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Algorithms for Estimation of Three-Dimensional 
Motion 


By A. N. NETRAVALI and J. SALZ* 
(Manuscript received March 1, 1984) 


We derive robust algorithms for estimating parameters of motion of rigid 
bodies that are observed by a television camera. Motion may be three- 
dimensional, containing both translational and rotational components, but 
the observations using the television camera are two-dimensional, i.e., projec- 
tions on the camera plane. Our algorithms do not require a priori knowledge 
of any corresponding points in three- and two-dimensional spaces. We give 
both recursive as well as nonrecursive algorithms that minimize the error in 
intensity by using the estimated motion parameters. Our theory has applica- 
tions in interframe coding, computer vision, and computer animation. The 
efficacy of our methods and the quality of the estimation procedures must 
await experimental verification. 


I. INTRODUCTION 


One of the most important problems in machine analysis of image 
sequences captured by a television camera is estimating the motion of 
objects in the field of view.’ We have previously given algorithms for 
estimating the displacement vector when the motion is restricted to 
translation in a plane perpendicular to the camera axis.”* This was 
later extended to situations where the illumination in the scene is 
spatially nonuniform* and to computationally more complex algo- 
rithms with better properties.” In this paper, we propose a further 
extension by developing algorithms for estimating parameters of three- 


* Authors are employees of AT&T Bell Laboratories. 
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dimensional motion. The motion thus may have both the translational 
as well as rotational components, and the translation may not be in a 
plane perpendicular to the camera axis. Of course, although the object 
is three-dimensional and it moves in a three-dimensional space, the 
observations made by the television camera are still in a two-dimen- 
sional space, i.e., the object is observed by being projected from the 
object space to the image plane. Thus, information is lost in going 
from the three-dimensional object space to the two-dimensional image 
plane; this is a major source of difficulty in such estimation problems. 
It leads to nonunique solutions or ambiguous situations, unless addi- 
tional information is made available. One such example of additional 
information is the correspondence of points in two- and three-dimen- 
sional space. This example is often used to determine camera position® 
or to make motion estimation unique.”® However, in many practical 
problems such correspondence is either difficult or impossible to 
establish. 

Our contribution in this paper is twofold. First, we develop equations 
of motion by noting the fact that a television camera creates a frame 
every thirtieth of a second. Most rigid body motion, in such a small 
amount of time, tends to be small. Therefore we develop models of 
incremental motion that each use three parameters for translation 
and rotation. Our second contribution is to give robust recursive and 
nonrecursive algorithms for estimating these parameters. The algo- 
rithms minimize the error in observed intensity by using these esti- 
mated motion parameters. Also, since the estimation algorithm is 
based on linearizing the intensity function, it is applicable in situations 
where the motion parameters are small. We also give an extension 
based on successive linearization that will work even when the motion 
is substantially large. 

Some of the limitations of our approach should be pointed out. 
First, we are considering rigid body motion, i.e., no deformation of the 
body is allowed as a function of time. Second, parameters are estimated 
to minimize the intensity estimation error, and therefore, they may 
not correspond exactly to the true motion parameters, particularly 
since the problem may not have a unique solution due to loss of 
information in transforming from three-dimensionality to two-dimen- 
sionality. However, we believe that in most reasonable cases, param- 
eters estimated by our procedure will be those corresponding to motion. 
Third, traditional difficulties with dynamic scene analysis, such as 
occlusion, spatial nonuniformity of motion parameters and illumina- 
tion, and lack of proper segmentation, are largely ignored at this stage. 
They will be considered in our future work. Last, and perhaps most 
important, we have no simulation results to evaluate the performance 
of our algorithms. We hope, however, that since motion estimation is 
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important in such diverse fields as computer animation, computer 
vision, and interframe coding, these algorithms will be specialized to 
many of these applications and then evaluated. 


Il. MOTION MODEL 


In this section, we develop a model of three-dimensional motion 
that includes translation and rotation. The only constraint we impose 
is that the body in motion stay rigid. Let us assume that the location 
of different points changes in the object space as a result of object 
motion and that only a two-dimensional projection on the image plane 
is observable using a camera. (See Fig. 1.) Let a point P designated by 
a vector r = col. (x, y, z) move to another point P’ designated by 
vector r’ = col. (x’, y’, 2’). Since the body stays rigid, 


r= Rr + T, (1) 


where R is a three-by-three rotation matrix and T is a three-dimen- 
sional translation vector. R can be represented in terms of the Eulerian 





OBJECT 
SPACE 


IMAGE PLANE XY 





x 


Fig. 1—Coordinate system showing object space and image plane. 
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angles ¢, 6, and y as a product of three matrices, each corresponding 
to rotation about one axis. Thus 


R = ABC, (2) 
where 

Fe cos¢@ sing 0 
A=|-sing cos¢d 0 (3) 

0 0 1 

ss 1 0 0 

B=1{0 cosé siné (4) 

0 -sin#é cosé 

= cosy 0 -siny 
CS. 0 1 0 |}. (5) 

siny 0 cosy 


These equations then specify general rigid body transformations. In 
practice, a television camera observes a new scene every thirtieth of a 
second. During such a small time, changes in the parameters of motion 
(i.e., 6, 6, YW, and T) will be small. We therefore specialize these 
equations to small or infinitesimal changes in motion parameters that 
have taken place within a frame time. 


2.1 Infinitesimal motion 


For infinitesimal motion, changes in Euler angles are small. If 
these are denoted by A6é, Ag, and Ay, then if we use approximations, 
cos A@ = 1 and sin Aé = AQ, eas. (3), (4), and (5) become 


: 1 Ad 0 

A=|-A¢d 1 0 (6) 
0 0 1 

; 1 0 O 

B={0 1 Ag (7) 
0 -Ag 1 

: 1 0 —Ay 

C={0 1 O }. (8) 
Ay 0 1 
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Therefore, 


R = ABC 
1 Ae —Ay 
=({-Ag 1 Ab 
Ay -A@ 1 


0 Wz —Wy 
=T+|-w, 0 w, | At, (9) 
Wy —ax OO 


where I is the identity matrix; and w,, w,, and w, are angular velocities 
about the x, y, and z axes, respectively. By substituting (9) into (1), 
we get 


fF’ = faa, = F, + Pr,At + Tat, (10) 
where 
: 0 W, —Wy 
P =| —-w, 0 Wy (11) 
Wy —w, OO 
and 
Dx 
T =|») vector of translational velocities. (12) 
vz 


Next we denote the coordinates in the image plane by (X, Y). The 
transformation from object space to image plane then proceeds as 
follows: 


X= Z0% (13) 
Zz 

y=) (14) 
Zz 


where Z is the distance from the origin of the object space to the 
origin of the image plane (see Fig. 1). Now for small At, (10) becomes 


x’ =x + w,yAt — wyzAt + v,At (15) 
y’ = y — w,xAt + w,zAt + vyAt (16) 
2’ = z+ wyxAt — w,yAt + v,At (17) 
and 
Pets (.%+4-o] a+ 2(-o 2402-4) at. (18) 
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Similarly, 
, U; 
Vets (a+ 2-03] a+2(-w 2+ 02-4) At. (19) 
z 2 z z z z a Se 


By letting 29 = 1 and using the definitions 


=. Xiet 
Zz Zz 
V.=— and V,=~, (20) 
z Zz 


we obtain 


X' =X + (w.Y — w, + V,)At + (—x* eX Vor ax “) At (21) 


Y’ = Y+ (-w,X + w, + V,)At + (-x¥., + Yu, — v4) At. (22) 


Let a = v,/z be the magnification parameter; then the differential 
movement of the coordinates in the image plane for infinitesimal 
motion is as follows: 


CN eae oe) + wXY+V,-—aX (23) 
dt At 
on = noe SHEN eee) HaRv eV Sav: on 


Thus there are six unknowns (w,, wy, w:, Vz, Vy, a) that need to be 
evaluated to quantify motion. The only values that can be observed 
are the intensities of the image in the present and the previous frames. 
Several techniques can be formulated to estimate these parameters. 
In the following, two techniques are described in detail. 


Ill. MOTION ESTIMATION 


The first technique deals with situations where the intensity changes 
only slightly as a result of small changes in motion parameters, 
whereas the second technique does successive linearization and there- 
fore can handle larger changes in intensity. Let I(X, Y, t) be the 
intensity function at time t. Then differential changes in intensity are 
expected by 


I(X, Y, ¢ + At) = 1(X + A, At, Y + A,At, ¢), (25) 


where 
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A, = wz Y — wy(1 + X?) + w,XY — aX + V, (26) 
Ay = —w,X + w,(1 + Y?) - wXY - aY + Vy. (27) 


Expanding the intensity function in power series in At yields 
I(X, Y, ¢ + At) = I(X, Y, t) + SUX, Y, t) 
-[A,JAt + - I(X, Y, t)-[A,jJAt. (28) 


Thus, 


I(X, Y, t + At) — I(X, Y, t) 
At 


= w,[L-(X, Y, t)Y¥ — 1,(X, Y, t)X] 
— w,[L.(X, Y, t)(1 + X2) + U(X, Y, t)XY] 
+ w,[L.(X, Y, XY + L(X, Y, t)-(1 + ¥%)] 
— afI,(X, Y, t)X + L(X, Y, t)Y] 
+ V,[L(X, Y, t)] + Vy[I,(X, Y, t)]. (29) 


Let (X;, Y;) be a pel deemed to be from the set of “moving-area” pels, 
i.e., the frame difference at these locations is above a certain pre- 
specified threshold. Then, for each such moving-area pel define the 
following six-dimensional vector 


I,(X;, Yi, t)¥; — (Xi, Yi, t)Xi 
—1,(X;, Yi, t)(1 + X?) — L(X&, Yi, X.Y; 
_ | 1X, ¥;, OXY; + 1(X:, Yi, 2). + Y;)? 
cas -1,(X;, Y;, t)X: — 1(X;, ¥;, oY; 
I,(Xi, Yi, t) 
I, (Xi, Yi; t) 


(30) 


If 
C = col. (wz, wy, wx, a, Vx, Vy) 


denotes the six-dimensional parameter vector that needs to be esti- 
mated, then we can express the measured intensity difference, M,;, 


M; = $/C + noise. (31) 


If the number of measurements is n, then the problem is to create a 
least-squares estimate of C (labeled C,,) that minimizes the following 
Mean-Squared Error (MSE) after these n measurements: 
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MSE = min y (M; -— @7C,)’. (32) 
Cc, 


Carrying out the minimization, we get the set of equations 


XY Mi: = b not] C,. (33) 


Thus, calculation of C, requires a matrix inversion at every step. The 
inversion can be carried out recursively as follows. 


Let 
A, = » ¢i¢t (34) 
i=1 
and 
Qn = X $M. (35) 
Clearly, 
Nn ='M-1 3 ¢,M,, (36) 
and 
-A, = Ani + Onda. (37) 
From the matrix inversion lemma in Ref. 9, we obtain 
-1 — ST). x: Anions Ant 
eh nh giana! a 
and when this is used in (33), we get the recursion 
: ae Ant ae 
C,, = Ch-1 ~— 14 OTA g, Pr bn On-1 M,.). (39) 
If 
ee = constant (40) 
PegfAsiée 


then the above reduces to a simple gradient algorithm. 
If the motion was purely translational in the image plane and there 
was no zooming (i.e., v, = 0), then 
W, = W, = w =a=0. (41) 


The estimation of motion parameters would then be analogous to our 
previous schemes. Matrix A, would be two-by-two, and the vectors 
such as 9,, C,, would be two-dimensional. 
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3.1 Successive linearization estimation 


In the previous section we derived an iterative procedure that did 
not use the previous estimates in the linearization process. Improve- 
ment may be obtained if the intensity function is linearized at different 
locations in the previous frame based on the value of the previous 
estimates.” Thus, as before, let C,,_; be the estimate of six parameters 
of motion made after observing a patch of (n — 1) pels. We wish to 
revise this estimate to obtain C,, which includes a patch of n pels 
obtained by adding a new pel to the previous patch of (n — 1) pels. 
Define 


Art = gly — ot (1 + X?) + OP EUXY — @ OX + Ve (42) 
An? = —@21X + Ot (1 + Y?) — OXY - a y+ Ve. (43) 
Also, define a new cost function DFD(-) to be* 
DFD(X, Y, Az, Az, t) 
= 1(X, Y,t + At) —I(X + At At, Y+ At At, t). (44) 
We note that, as defined above, DFD = 0 if 
Ak =A,, for any X,Y 
At1=A,, for any X, Y, (45) 


i.e., when our estimates of motion are equal to the true value of the 
parameters of motion. Also, for any given estimate of the motion 
parameters, DFD can be calculated if we know the intensities of two 
successive frames. As before, we can now expand DFD in Taylor’s 
series. Thus 


DFD(X, Y, Az, Az, t) 
= 1(X, Y,t + At) — 1(X + At At, Y + A?" At, t) 
= 1(X + A,At, Y + AyAt, t) — 1(X + At At, Y + At At, t) 
= 1(X + At" At + (A, — At At, Y + At. At 
+ (Ay — At )aAt, t) — 1(X + At}. At, Y+ Anat, t). (46) 
Let 
I =1(X + Amat, Y + Anat, t) (47) 


* As in Ref. 2, DFD stands for displaced frame difference. 
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DFD(X, Y, Az, Ar}, t) 
At 


= (w, — 62 )(1.Y - I,X] 
— (wy — OP). + X*) + IXY] 
+ (wy — GL XY + 1,0 + Y4] 
+ (4"" — af1,X + LLY] 
+(V,— Vi")(T.] 


+ (Vy — VST. (48) 
Once again, define a vector of measured quantities 
L,Y; — 1,X; 


-1,(1 + X?) - i,X:Y; 
I,X;Y; + 1,0. + Y?) 


n-1 _ 
a -1,X;- 1,Y, a 


Lil 


Then 
M?"! = DFD(X, Y, At, An, t)/At 
= (67 ")T(C — G,-1) + noise. (50) 


A 


The least-squares estimate, C,,, should minimize the following mean- 
squared error, 


MSE = min 12 [Mr — (pF )7(C — C,4)P (51) 
C i=1 

for any given initial estimate C,_,. As before, carrying out the min- 

imization, we get 


o?).M? = b omer (C, — C,-1). (52) 


me 


1 


o. 
ll 


Then 
n -1 n 
C,, = Cr-1 + b omer | 5 oro | ‘ (53) 
i=1 jet 


As in the previous section, the matrix inversion lemma can be used to 
invert the matrix 
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b omer | 


The real difference between this method and that of the previous 
section is that even if the motion is large (i.e., parameters of motion 
are somewhat large), the successive linearization, if it converges, gives 
more accurate estimates, since (C — C,-1) becomes smaller as itera- 
tions proceed. 


IV. CONCLUSIONS 


A mathematical theory that provides algorithms for robust estima- 
tion of a general set of motion parameters from frame sequences 
obtained from a television camera is now available. The theory was 
derived under mild assumptions. Both recursive and nonrecursive 
algorithms are provided. The efficacy of our algorithms has to be 
evaluated for each application. Potential applications are for inter- 
frame coding, computer vision, and computer animation. 
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Homenet: A Broadband Voice/Data/Video 
Network on CATV Systems 


By M. HATAMIAN and E. G. BOWEN* 
(Manuscript received April 3, 1984) 


Homenet is a broadband distributed communication system that supports 
data, real-time digitized voice, and analog video on a single cable in a CATV 
type of network. The distance limitation problem encountered in local area 
networking schemes is eliminated by dividing the large CATV net into smaller 
“homenets.” This feature makes the network suitable for a large number of 
users located in a relatively wide geographic scope. This paper describes the 
implementation of a small experimental version of this system in hardware. 
More attention is given to the protocol processing hardware, which implements 
a protocol based on collision detection called Movable Slot Time Division 
Multiplexing (MSTDM). The MSTDM protocol guarantees the continuity of 
voice signals received at the user station. Problems such as clock synchroni- 
zation and confusion of data and voice packets are addressed, and solutions 
are given. Presently, an experimental network composed of five user nodes in 
two different frequency nets is operational. An interactive video retrieval 
service implemented in the network is described as an example of the type of 
user services (other than data/voice/one-way video) that can be offered at the 
main head end of the system. 


Il. INTRODUCTION 


The concept of Local Area Networks (LANs) is a well-developed 
one in the field of computing and data communication.’ These net- 
works are used for sharing computing resources, and for communicat- 


* Authors are employees of AT&T Bell Laboratories. 


Copyright © 1985 AT&T. Photo reproduction for noncommercial use is permitted with- 
out payment of royalty provided that each reproduction is done without alteration and 
that the Journal reference and copyright notice are included on the first page. The title 
and abstract, but no other portions, of this paper may be copied or distributed royalty 
free by computer-based and other information-service systems without further permis- 
sion. Permission to reproduce or republish any other portion of this paper must be 
obtained from the Editor. 


347 


ing data among a number of users in a limited geographic scope. 
Adding voice and video capability to these networks makes them very 
attractive for the office information systems of the future. Further- 
more, if the distance limitation and the constraint of limited geo- 
graphic scope are removed, then such networks become potential 
candidates for the home information systems of the future, provided 
that the cost of user’s terminal equipment is minimal. Such networks 
should more appropriately be called Metropolitan Area Networks 
(MANSs) rather than LANs. Solving the distance limitation problem 
can also increase the attractiveness of the office information systems; 
office branches located at distant locations can become part of the 
network and can share information. . 

Homenet is a broadband data/voice/video communication system, 
first proposed in Ref. 1, which satisfies all the above requirements for 
-MANs. The system combines frequency and time multiplexing, and 
supports the communication of data, real-time digitized voice, and 
analog TV signals on a single cable in a cable TV (CATV) type of 
network. This paper describes the hardware implementation of the 
homenet and some of its features. Presently, a fully working testbed 
composed of five user stations connected in two nets is in operation. 

Sections II and III describe the homenet and the communication 
protocol. Section IV gives a detailed description of the hardware 
implementation of the system and its features. More attention is given 
to the protocol processing hardware, which is essentially the intelligent 
node of a distributed packet switching network. 


Il. WHAT IS A HOMENET? 


Homenet is a broadband communication system based on a combi- 
nation of frequency and time multiplexing, and distributed packet 
switching techniques. The system supports the communication of data, 
digitized voice, and one-way analog T'V signals on a cable in a CATV 
type of network. Since it is basically a distributed switching network, 
all the switching functions are performed at the user’s terminal equip- 
ment and there is no central switching involved. Reference 2 describes 
the system in detail. 

A relatively large community of users is divided into small geo- 
graphic regions and each region, called a homenet (or a net for short), 
is assigned a 6-MHz frequency band. Users within each frequency 
band can receive signals from any other net by tuning their receivers 
to the frequency of that net. This tuning is performed automatically 
by a signaling scheme. All users transmit their data and digitized voice 
signals on a single transmit frequency, Fo, which is then translated to 
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its homenet frequency and propagated throughout the network in such 
a way that all the users in all nets are able to receive it. Before data 
are transmitted, users in each frequency band have to contend for the 
channel using the protocol described in the next section of this paper. 
The overall operation of the homenet is shown in Fig. 1. Suppose 
that user No. N in net 3 wants to transmit a packet to user No. 1 in 
net 2. First, user No. N contends for the channel in net 3 and once 
access to the channel is gained, it transmits its data on frequency Fo 
to the nominal head end of net 3 (H3). At this point, frequency Fp is 
translated to two different frequency bands, F3 and Fr3. Frequency F'3 
is transmitted downstream to the users in net 3 and all the nets 
following net 3 (in this example there are no nets following net 3). 
Frequency Fr; is transmitted upstream to the main head end of the 
network (H;) at net 1, where it is translated back to F; and sent to 
the users in net 1 and net 2. Now user No. 1 in net 2, with its tuner 
listening to frequency F'3, can receive the packet by demodulating the 
signal from F3 to baseband and searching for its address in the address 
area of the packet. Other than the frequency translation operations, 
each nominal head end H,; is equipped with two notch filters—one for 
frequency band F;, which stops the signal translated from Fr; to F; at 
the head end; and one for Fo, which allows all the nets to use the same 
transmit frequency Fo without interfering with their adjacent nets. 
The above scheme is certainly not the only possible way of using 
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the frequency bands;* however, it is the one that requires the least 
amount of hardware for each user station and also is compatible with 
currently installed midsplit CATV networks. 

Establishing a communication network like homenet for a large 
community of users of the size accommodated by homenet would not 
be possible by direct extension of LAN techniques (i.e., increasing the 
cable length and the bit rate). One of the major limitations of the local 
area networks is the cable length constraint, which forces the users to 
be located close to each other to prevent large propagation delays. 
Ethernet,* one of the best-known local area networks,’ is limited to a 
cable length of about 2.5 kilometers operating at 10 Mb/s; increasing 
the bit rate or the cable length results in an appreciable reduction in 
the efficiency of the system. In homenet this distance limitation 
problem is solved by using different frequency bands and grouping the 
users that are located close together into one frequency net. This way, 
the users in each net have to contend for transmission rights only 
among themselves, and do not have to worry about transmitters in 
other frequency nets; they can still listen to all other frequency nets, 
so a complete connection between all users in all nets exists. This 
broadband technique increases the size of the network (i.e., the length 
of the cable) and at the same time reduces propagation delays, which 
are very important, especially for access strategies that use collision 
detection schemes such as CSMA/CD. Each net can operate at low 
bit rates but the total throughput of the network can go up to several 
hundred Mb/s, depending on the number of frequency bands used in 
the system. 


Ill. COMMUNICATION PROTOCOL 


The communication protocol used in each net of the homenet system 
is called Movable Slot Time Division Multiplexing (MSTDM)*—a 
variation of the CSMA/CD technique used in Ethernet. This protocol 
guarantees the continuity of the voice signals received at each user 
station, a task that no other currently available protocol based on 
collision detection can handle. This protocol is described in the follow- 
ing section. 

Integration of packetized data and voice in a local area network 
requires an upper limit on the voice packet delays to ensure that the 
voice receiver does not run out of samples before new voice samples 
arrive. This requirement in turn guarantees a glitch-free, continuous 
speech signal at the output of the voice receiver. None of the currently 
available protocols satisfy the above requirement and hence are not 
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suitable for integration of data and digitized voice. MSTDM protocol 
places an upper bound on the voice packet delays; it guarantees the 
continuity of the reconstructed voice signal and also guarantees that 
once access to the channel is gained, no two voice sources can collide. 
MSTDM takes advantage of the periodicity of the voice packets; it 
also requires that the size of the data packets be smaller than the voice 
packets. A detailed treatment of this protocol can be found in Ref. 6. 
A description of its operation is given below. 

In MSTDM a distinction is drawn between the first packet from a 
voice source and all the following voice packets from that source 
(called the secondary voice packets in this paper). The first voice 
packet and the data packets are treated the same way as in CSMA/ 
CD. They check the channel busy signal to monitor the status of the 
channel, and once the channel is idle, they start transmitting. They 
listen to the channel while it is transmitting to make sure that no 
collision occurs, and if there is a collision, then the colliding sources 
stop their transmission and try to access the channel after a period of 
time defined by a retry strategy. This procedure stays the same for all 
data packets; however, for the voice sources the procedure is different. 
Once the first packet of voice successfully acquires the channel, then 
the following packets from that voice source get transmitted (when 
they are ready for transmission) as soon as the channel becomes idle; 
they do not listen to the channel for collision during transmission. If 
a collision occurs between these secondary voice packets and any other 
packet, the other transmitter is forced to stop its transmission and 
the secondary voice packets override. 

Figures 2a and b show the voice and data packet formats, respec- 
tively. When there is a collision between a secondary voice packet and 
a data packet, the preempt portion of the voice packet, which does not 
contain any information, allows enough time for the data source to 
detect collision and stop its transmitter before the sync bits from the 
secondary voice packet appear on the channel. 

After a voice source transmits a packet, it schedules its next trans- 


PREEMPT DESTINATION| SOURCE OVERFLOW 
HEADER | SYNC ADDRESS’ | ADDRESS VOICE BITS AREA 


a) 


DESTINATION| SOURCE 
(b) 
Fig. 2—Packet formats for (a) voice packet, and (b) data packet. 
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mission for T seconds later. Assuming the rather unlikely situation 
where all the packets from a particular voice source find the channel 
idle when they are ready to be transmitted, then, for that particular 
voice source, the channel looks exactly like a TDM system with 
reserved time slots that are T seconds apart. However, it is quite likely 
that when the voice source is ready for its next transmission, a data 
source or another voice source is in the middle of transmitting its 
packet. In this case the voice transmitter has to wait for the channel 
to become idle. Obviously, in this situation the voice packet will be 
delayed and this delay causes the time slot for the voice source to 
move back in time. Therefore, for the voice sources the system looks 
like a TDM channel in which the time slots are not fixed in time and 
are free to move; hence the name movable slot TDM. 

Reference 6 proves that if the voice packet delay is less than a 
packet transmission time (the upper bound on the voice packet delay 
in MSTDM to which we previously referred), then voice sources will 
never collide and no voice samples will be lost. To satisfy this require- 
ment data packets are constrained to be shorter than voice packets. 

The voice samples arriving during the voice packet delay time are 
stored in the overflow portion of the voice buffer (see Fig. 2a), and are 
transmitted along with the rest of the packet when the channel 
becomes available. The overflow area is always transmitted even if it 
does not contain any voice samples (i.e., the case when the voice 
packet is not delayed). This contributes to the proof of the fact that 
the voice sources never collide in MSTDM once they successfully 
transmit their first packet.? Obviously the overflow area of the voice 
buffer should be long enough to accommodate the voice samples that 
arrive during the voice packet delay, which is less than a packet 
transmission time. Since the transmission clock rate is much higher 
than the voice sampling rate, the size of the overflow area need not be 
larger than a few bits. 


IV. HARDWARE IMPLEMENTATION 


A small version of the homenet network described in previous 
sections has been built-in hardware for experimental purposes and 
feasibility studies. We currently have a fully working testbed composed 
of five user stations running in two frequency nets. Except for the 
protocol processing hardware, which is the most important part of a 
user station, all the components used in the system (such as frequency 
translators, taps, splitters, channel selectors, cable, etc.) are similar to 
those used by the CATV industry. 
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Fig. 3—Block diagram of user node hardware. 


4.1 User node 


Each user in homenet requires some hardware to interface his or 
her equipment to the communication cable. The user-node hardware 
can be thought of as a black box with one end connected to a cable 
and the other end to the user’s voice source (normally a phone set), 
data source, and T'V set. Figure 3 shows various components of the 
user-node hardware. The signal picked up from the cable through the 
tap is split and distributed in three ways. One line is connected directly 
to a TV set, another line feeds a collision demodulator whose function 
will be described later in this paper, and the third line is connected to 
a channel selector. A signaling scheme controls the channel selector 
and sets it to the frequency of the net that the user wants to hear. 
This frequency band is converted to a common Intermediate Fre- 
quency (IF) band Fy and fed to a demodulator, which translates the 
information contained in band F» to a digital bit stream. This bit 
stream is then processed by the protocol processor and, if the infor- 
mation is destined for the user’s address, it will be appropriately sent 
to either the data or the voice section. For the purpose of transmission, 
once the channel is accessed, the packets prepared by the protocol 
processor are modulated to the frequency band Fy using a modulator 
and sent to the user’s nominal head end (through the tap and over the 
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cable) to be distributed throughout the network, as we described in 
Section II. 

Except for the protocol processor, which is special-purpose hard- 
ware, the rest of the user node components are commercially available 
items. The modulators and demodulators are tuned to a center fre- 
quency of 43.4 MHz (Fp in current system). 


4.2 Protocol processor 


The protocol processor is essentially the intelligent part of the user 
node hardware. It is responsible for digitization and packetization of 
voice signals, packetization of data, and most importantly, implemen- 
tation of the MSTDM protocol. 

The processor is divided into two main sections, transmitter and 
receiver. Each section has two separate circuits, one for voice and 
another for data. Following is a detailed description of the operation 
of these circuits. 


4.2.1 Transmitter 


The voice and data sections of the protocol processor’s transmitter 
operate independently of each other, with their own dedicated buffers. 
In terms of the ordering of the sync and address fields, the packet 
formats are fixed and are as shown in Fig. 2. However, the position of 
the syne word, the number of sync words, the length of the voice 
preempt header, and the length of the packet can be arbitrarily set by 
the user to conform to a net standard. The packet length can be set 
to any number of bits fewer than 4096. 


4.2.1.1 Voice. Before we describe the operation of the voice transmit- 
ter, we should discuss the structure of the voice buffer. As we men- 
tioned before, the first voice packet is treated the same as data packets 
in terms of accessing the channel. The moment that the voice signal 
is activated, the voice transmitter starts filling the voice buffers and 
at the same time makes a request for transmission. When transmission 
right is granted, the number of collected voice samples may not be 
enough to fill the whole buffer, and as a result a number of empty 
locations (noise bits) will remain at the end of the buffer. Now, if the 
buffer is transmitted from beginning to end, then the empty area will 
cause a quiet interval (or a glitch) in the voice signal between the first 
and the second voice packets. In applications where a Time Assign- 
ment Speech Interpolation (TASI) mode of operation is desired, this 
effect can cause serious distortion in the reconstructed speech signal. 
However, if the empty portion of the first voice packet is transmitted 
before the actual voice bits, then the quiet interval will not be in the 
middle of the voice signal and will not create any difficulty. This 
procedure is illustrated in Fig. 4. The buffer is shown in this figure as 
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Fig. 4—Voice buffer operation. 


a long First-In First-Out (FIFO) shift register; a First Packet flag 
signal (FP in Fig. 4) indicates whether the bits would be read out from 
the end of the FIFO register or from its head position (i.e., the first 
voice bit). In Very Large-Scale Integration (VLSI) design, the imple- 
mentation of such a buffer is rather simple considering the regularity 
of its structure. In Transistor-Transistor Logic (TTL) design, we used 
RAMs as buffers, and counters and latch registers to keep track of the 
first voice bit position, last voice bit position, and the length of the 
unused portion of the buffer. 

Figure 5 is a block diagram of the voice transmitter. To be able to 
handle the voice signal in real time, two buffers operating in ping- 
pong mode are needed; one buffer is being filled with voice bits while 
the other is being transmitted. The combination of RAM and counter 
blocks in Fig. 5 represent a FIFO register which, in conjunction with 
the first packet handler circuit, operates in a manner described above. 
The input voice signal from the voice source is digitized and converted 
to a serial bit stream by a 64-kb/s y-law codec chip (8-kHz sampling 
rate, 8 b/sample). A multiplexer at the input of the ping-pong buffer 
controls the distribution of the input bits, clock signals (voice clock 
and transmission clock), and memory write pulses (R/W) to the 
buffers. The buffer that is supposed to be transmitted receives the 
transmission clock and no write pulses; the buffer that is being filled 
with the voice bits receives the voice clock (64 kHz), the voice bits, 
and the memory write pulses. 

Before trying to establish a voice connection, the Timing and 
Control Circuit (TCC) first clears the buffers and then writes the 
header information (i.e., sync, destination and source addresses) into 
both ping and pong buffers through the input multiplexer. At the time 
the first packet of voice begins to be formed in the ping buffer, a 
request for transmission is made by TCC to the Transmit Request 
Circuit (XRC). Once XRC receives the request, it starts monitoring 
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the channel busy signal provided by the demodulator, and the moment 
the channel becomes idle XRC sends an acknowledgment to TCC. 
Upon receiving this acknowledgment, TCC switches the ping-pong 
mode (by changing the multiplexer control signals) and sets a trans- 
mission flip-flop. Now the pong buffer starts receiving the voice bits 
and the ping buffer, which contains the first packet of voice, begins to 
be transmitted on the cable under the control of the first packet 
handler circuit, through the output multiplexer and the Frequency 
Shift Keying (FSK) modulator. If no collision is reported by the 
Collision Detection Circuit (CDC), then the transmission continues 
until the last bit of the voice packet is sent, at which time TCC clears 
the transmit flip-flop and sends a signal to the Next Packet Scheduling 
(NPS) circuit. NPS schedules a transmission request for T seconds 
later. T is switch selectable and is set by the user, depending on the 
length of the voice packet. During this T seconds, voice samples are 
being stored in the pong buffer, and the ping buffer, which has already 
transmitted its data, is waiting to be switched. After T’ seconds, NPS 
sends a transmission request to XRC and when transmission right is 
granted, TCC again sets the transmit flip-flop and switches the ping- 
pong mode. Now, the input voice samples are routed to the ping buffer 
and the secondary voice packet contained in the pong buffer begins 
its transmission. The first packet handler and the collision detection 
circuits are disabled at this time, because the first packet has been 
successfully transmitted and there is no need to check for collisions 
(see the description of MSTDM protocol in Section III). When the 
second packet is transmitted, NPS receives a transmit done signal and 
again schedules a request for T seconds later, and the procedure cycles 
until the user decides to end the voice connection. 

If, during the transmission of the first packet, a collision is detected 
by CDC, then the transmit flip-flop is cleared, the buffer pointer is 
reset to the beginning, and TCC tries to make a request again. At 
present, the only retry strategy built into the protocol processor is on 
the basis of a random retry. The operation of the collision detection 
circuit is described in a separate section following the discussion of 
the data transmitter. 


4.2.1.2 Data. Figure 6 is a block diagram of the data transmitter. This 
circuit is considerably less complex than the voice transmitter. It does 
not require the ping-pong buffering mode and there is no difference 
between the way the first packet and the following ones are handled; 
this eliminates the need for the first packet handler circuit. 

Before establishing a data connection, the TCC writes the header 
information into the buffer, and then it sets the transmitter ready 
flip-flop. Now the transmitter buffer is ready to accept data bits. The 
data are written in the buffer and once the packet is formed and the 
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Fig. 6—Block diagram of the data transmitter. 


buffer is full, the End of Packet Detection (EPD) circuit sends a signal 
to TCC, which resets the transmitter ready flip-flop. The transmitter 
is then ready to access the channel and data sources should not try to 
send any new data for packetization. At the same time, TCC sends a 
transmission request signal to the XRC. When the channel becomes 
idle, XRC sets a transmit flip-flop and the transmission of the packet 
begins. Once the last bit of the packet is transmitted, XRC generates 
a transmit done signal, which sets the transmitter ready flip-flop; the 
buffer then becomes available for packetizing data bits, and the cycle 
starts again. 

In case of a collision with another packet, the same procedure used 
for the first packet of voice is used. The collision detection circuit is 
shared between the voice and data transmitter. 


4.2.1.3 Collision detection. In homenet, collisions can be detected in 
the digital domain by comparing every single bit of a packet before 
and after its transmission. Relying on just the amplitude of the signal 
on the line at each transmitter for detecting interference from other 
transmitters can result in missing collisions that are caused by trans- 
mitters that are distant from each other. In homenet, however, all the 
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transmitters in each net, say net H;, send their signal, on frequency 
Fy, to the nominal head end. There it is translated to frequency band 
F; and returned to all sites. By inserting proper attenuators on the 
transmitters, we arrange to have the same amplitude for all the signals 
received at the nominal head end from different transmitters. As a 
result, the signals that are returned in frequency F; at each site have 
the same relative amplitude, and there is no chance of missing colli- 
sions between distant transmitters. 

The collision detection circuit has a dedicated demodulator that is 
always listening to the frequency of the net that the transmitter is in 
(i.e., F; for net H;). This demodulator is referred to as the collision 
demodulator in the block diagram of Fig. 3. Notice that detecting 
collision in frequency Fo rather than F; does not have any advantage 
over baseband, and would not solve the amplitude problem. 

It is obvious that the process of modulating the bits to frequency 
Fy, transmitting to the nominal head end, translating to frequency 
band F;, and transmitting them back will introduce some location- 
dependent delay between the bits that leave the transmitter and the 
ones that are received by the collision demodulator. The collision 
detection circuit corrects for this delay by inserting an equal delay on 
the transmitted bits before they are compared. This is accomplished 
by an adjustable delay line on one input of the CDC, which is adjusted 
only once depending on the user’s distance from the nominal head 
end. 


4.2.2 Receiver 


Much like the transmitter, the receiver is also divided into two 
almost independent sections, voice and data. In the transmitter, the 
collision detection circuit was shared between the voice and data 
sections. In the receiver, there is no need for collision detection; 
however, there is one circuit that is shared between the two sections, 
and that is the Sync and Address Detection (SAD) circuit. 


4.2.2.1 Voice. Figure 7 shows the block diagram of the voice receiver. 
The operation of the ping and pong buffers in the receiver is similar 
to the voice transmitter. When one is receiving the input bit stream 
at the channel rate, the other is playing the previously received voice 
packet into a codec at the voice bit rate (64 kb/s). The switching of 
the ping-pong mode is done by the TCC. The input bit stream is first 
demodulated from band F; to baseband and directly fed to the input 
of the buffers, the SAD, and the end of packet detection circuit. Right 
after the transition of the channel busy signal from an idle to a busy 
state, SAD starts looking for a syne word. The sync words for data 
and voice differ in their most significant bit. If the detected sync is a 
voice sync, then the receiver starts looking for either another sync 
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Fig. 7—Block diagram of the voice receiver. 


word or its address. The sync words can be repeated consecutively in 
the header area of the packet as many times as desired. The destination 
address should always be immediately after a sync word. 

If the sync word indicates a voice packet and the destination address 
field of the packet is matched with the receiver’s address, then SAD 
first stores the following field (i.e., the source field) as the address of 
the source, and sends a signal to TCC indicating the beginning of the 
voice bits. TCC then sets a receive flip-flop and applies the transmis- 
sion clock to the ping buffer. The voice bits begin to be stored in this 
buffer until an end of packet is detected by SAD, at which time TCC 
resets the receive flip-flop and switches the ping-pong mode. The end 
of packet detection is not done based on just the packet length. The 
voice transmitter always transmits a fixed number of bits for each 
packet, as required by MSTDM protocol. However, not all of these 
bits are actual voice bits; the bits in the overflow area may not be 
useful information. Therefore the end of the voice samples in the 
packet must be marked. This is done at the transmitter by placing a 
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flag byte at the end of the voice samples with all eight bits set to “1.” 
When the transmitter’s ping-pong mode is switched, the value of the 
counter for the buffer that was receiving the voice bits is saved; when 
the transmission of this buffer begins, the control circuit monitors the 
buffer counter and when it is equal to the saved value, TCC forces the 
output bits to a high state until the buffer counters reach the packet 
length. Therefore the unused portion of the voice packet is transmitted 
as “all ones.” At the receiver, SAD circuit searches for an “all one” 
flag byte and sends an end of packet signal to TCC, as mentioned 
above. The codec used for digitizing the voice signal does not use the 
“all one” level in its code. Therefore, no voice samples will be coded 
as all ones to cause a false end of packet detection at the receiver. 

Once the end of packet is detected and the ping-pong mode is 
switched by TCC, the voice samples stored in, say, the ping buffer are 
played back to the codec to reconstruct the voice signal. The pong 
buffer is idle at this time and is waiting to receive the next voice 
packet, which will arrive sometime during the playback process. 

In MSTDM protocol, the voice packets can exercise a bounded delay 
less than one packet transmission time. This delay may cause the 
receiver buffer to run out of voice samples before the next packet 
arrives, and may also cause a distortion in the speech signal. To 
alleviate this problem, a small delay is inserted before the beginning 
of the playback process for the first packet of voice (only the first 
packet). This task is accomplished by the First Packet Delay (FPD) 
circuit at the receiver (see Fig. 7). After the first packet is played back, 
TCC deactivates this circuit. The FPD circuit is also used for recover- 
ing from a distortion problem created by timing discrepancies between 
user nodes. This will be described in detail in a later section on clock 
synchronization. 

4.2.2.2 Data. Figure 8 shows a block diagram of the data receiver. 
This circuit is the simplest section of the protocol processor. The sync 
and address detection circuit is shared between the voice and data 
sections. When a data sync pattern followed by the receiver’s address 
is detected in the input bit stream, TCC first resets a data ready flip- 
flop, indicating that the receiver buffer is being filled with the incoming 
packet bits. The receive clock is then applied to the buffer counter 
and data bits are stored in the buffer until an end of packet signal is 
generated by the EPD circuit. The end of packet is simply detected by 
comparing the value of the buffer counter with the length of the data 
packet. This signal sets the data ready flip-flop, indicating that the 
receiver buffer contains valid data bits and is ready to transfer those 
to the host system. At this time, if the host ready signal is high, TCC 
applies the data clock to the buffer counter and the data bits are 
transferred to the host system. Once this transfer is made, the circuit 
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Fig. 8—Block diagram of the data receiver. 


is reset and TCC waits for the next data sync detect signal, at which 
time the cycle starts again. 

The performance of both the data receiver and transmitter can be 
somewhat improved using a ping-pong buffering scheme as in the voice 
section. However, since this is not a necessity, we did not choose to 
implement it in our experimental system. This option can always be 
easily incorporated into the system if needed. 


4.2.3 Clock synchronization 


When a communication link, either data or voice, is established 
between two sites, it is obvious that for proper operation the receiver’s 
clock pulses should be synchronized with the incoming data (i.e., 
synchronized with the transmitter’s clock). In homenet, each user 
station (user-node hardware) has its own crystal oscillator generating 
a 16-MHz master clock signal. All the clock pulses used by the protocol 
processor are derived from this master clock, and are synchronized 
with the incoming data using the transitions of the input bit stream. 
The synchronization circuit, which is part of the demodulator board, 
uses a very simple digital technique similar to the clock recovery 
circuits used with nonreturn to zero data streams. 

For the purpose of synchronization, the receiver requires at least 
two or three transitions in the incoming bit stream before any useful 
data can be picked up from the line. For the voice packets the required 
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transitions can be placed in the preempt portion of the packet; for the 
data packets a 4-bit preempt header is added to the beginning of the 
packet format shown in Fig. 2b. 

Due to different operating conditions the frequency of the crystal 
oscillators at two ends of a communication link can be slightly differ- 
ent, thus creating a minor timing discrepancy. This small frequency 
difference does not create any difficulty in receiving the information 
bits because the bit values are read into the receiver buffer in the 
middle of clock pulses, and a small drift can be tolerated. However, 
due to the periodic nature of the voice sources and their real-time 
requirement, the timing discrepancy affects the voice section of the 
protocol processor in two ways, as described below. 

First, we consider the effect of timing discrepancy on the next 
packet scheduling time. The situation is illustrated in the timing 
diagram shown in Fig. 9 for two-arbitrary voice sources that’ have 
successfully transmitted their first packet and reserved a movable time 
slot on the channel. Voice source No. 1 schedules its next transmission 
for T seconds after it transmits the current packet (i.e., T’ seconds 
after the falling edge of TRANSMIT signal in Fig. 9). After T seconds, 
the NPS circuit generates a TRANSMIT REQUEST pulse and the 
voice source is guaranteed to have access to the channel within 6 
seconds, where 6 is between zero and a maximum of one packet 
transmission time. Voice source No. 2 operates in exactly the same 
way except that, owing to the slight timing discrepancies between the 
two sources, the next packet scheduling time for this source will be 
T + « rather than T. This can cause the TRANSMIT REQUEST 
pulse and the time slot for voice source No. 2 to drift very slowly in 
time with respect to voice source No. 1 (see Fig. 9). The drifting 
continues until the time slots for both sources are adjacent to each 
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Fig. 9—Effect of timing discrepancy on next packet scheduling. 
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other. If there are a number of voice sources on the line, all the time 
slots gradually move until they are adjacent to each other. Notice that 
this gradual movement of the slots is different from the movement 
dictated by MSTDM protocol, which can vary depending on the data 
traffic. If the data traffic is light and the voice sources continue to 
stay on the channel for a long time, then, from the above discussion, 
all the time slots will eventually be packed next to each other. The 
timing discrepancy in this case does not introduce any difficulty; as a 
matter of fact, it creates a favorable situation. 

With respect to the rate at which the voice samples are generated 
at the source and used at the destination, the timing discrepancy can 
create an undesirable situation. If the receiver clock is slower than the 
transmitter, then there will be a time when both ping and pong buffers 
at the receiver will be full when a new packet arrives. If the receiver 
clock is faster than the traasmitter, then the receiver will eventually 
run of out of samples before a-new packet arrives. These situations 
cause a distortion in the voice signal, which cannot be recovered from 
for some time. In the receiver hardware, both of the above situations 
cause the ping-pong switching command to be generated at the same 
time with transmission. To solve the problem, when such a condition 
is detected, TCC activates the first packet delay circuit (see Fig. 7), 
which inserts a delay in the playback of only the next packet and 
causes the ping-pong switching command not to overlap with the 
transmission. The consequence is that the current voice packet is lost, 
but the receiver goes back into a normal undistorted operation mode. 
In our system, which uses a conventional oscillator and crystal type, 
this situation occurs about every 20 minutes. In other words, if a 
conversation lasts for a long time, then every 20 minutes one packet 
of voice (i.e., a few milliseconds of voice signal) will be lost. There are 
more expensive solutions that totally eliminate the problem; however, 
losing a few milliseconds out of 20 minutes (12 x 10° ms) is of no 
significance at all, and more expensive solutions are not justified. 
Besides, using better crystal types can increase the 20-minute period 
to well over an hour. 


4.2.4 Packet confusion 


In the case of collision between a secondary voice packet and a data 
packet, a confusion between packets can occur at the receiver. Consider 
the following situation: a data source is transmitting when it detects 
a collision with a voice packet after the sync word has been transmitted 
(i.e., in the address area of the packet). The data transmitter stops, 
the voice transmitter continues with its transmission, and the line will 
look something like the following: 
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VOICE VOICE 
DATA VOICE VOICE 
GARBAGE DESTINATION | SOURCE 


In the above display the collision occurred in the garbage area, which 
includes the voice preempt header as well. 

Let us consider what happens at, the receivers. All the receivers 
detect the data sync and start looking for the destination address. One 
particular receiver whose address is equal to the first byte of the 
garbage area is going to detect a match, and it will erroneously start 
storing the following bits as a data packet. Obviously this situation is 
very undesirable. We can solve the problem by looking at the length 
of the received packet. A false data packet like the above will certainly 
be longer than a normal data packet. If, at the time that the end of 
packet pulse is detected (based on the packet length as described 
before) the receiver is still busy with the incoming bit stream, then 
the received data packet must be a false one and should therefore be 
discarded. 

The receiver that is supposed to receive the voice packet first 
synchronizes on the data sync; however, since it does not find its 
address following the data sync, it resets itself and starts searching 
for a combination of sync and destination address again. It should be 
noted that this problem occurs during the preempt header so that the 
data transmitter stops and the voice sync and address are undisturbed. 
Therefore, we do not have to worry about the voice packet; it will get 
to where it is supposed to. Of course, in the highly unlikely situation 
of the voice destination address being exactly the same as what is 
found in the garbage area, the above assumption will not be true and 
the voice packet will be lost. 


4.3 Head end 


Aside from the frequency translation and filtering operations, a 
variety of user services can be incorporated in the homenet’s main 
head end (head end H, in Fig. 1). Our experimental system presently 
supports two services, an interactive video disc and a service for 
establishing voice links with sources outside the network. Figure 10 
shows the block diagram of the main head end. A user station identical 
to the ones used at the user nodes is dedicated to this head end. 

A head-end processor, which, depending on the application and the 
network requirements, can be anything from a small microprocessor 
to a large computer system, controls all the services. A user can control 
a video disc located at the head end by sending commands to the head- 
end processor over the cable using its data transmitter. These com- 
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Fig. 10—Block diagram of homenet’s main head end. 
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mands are interpreted by the head-end processor and the proper 
signals are sent to the video disc through a serial port. 

To establish a voice link with sources outside the network, a user 
sends a data packet to the head-end processor giving the number to 
be called; the head-end processor then sends the proper commands to 
an autodialing system, which connects the voice path of the protocol 
processor to the outside line. Work on this autodialing option, as well 
as on other services such as call processing, file transfer, and higher- 
level communication protocols, is currently in progress. 


V. CONCLUSION 


This paper described the hardware implementation of a network for 
simultaneous communication of data, digitized voice, and analog video 
in a CATV system. The network uses a broadband approach to solve 
the distance limitation and delay problems suffered in local area 
networks. As a result, a considerably large community of users can be 
supported by the network. The problem of guaranteeing a continuous 
nondistorted speech signal at the receiver is solved using a variation 
of CSMA/CD protocol called movable slot time division multiplexing. 
This protocol places an upper bound on the maximum delay that can 
be experienced by a voice packet. 

At the present time, an experimental network of five user nodes in 
two frequency bands is fully operational. The protocol processing 
hardware was described in detail. This processor is built with standard 
TTL components. It was shown that, owing to the method used by 
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the protocol processor for scheduling packet transmission times and 
compensating for timing discrepancy, clock synchronization is not a 
major problem. 
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Analysis of a Multistage Queue 
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Multistage queueing mechanisms with quantum service are suitable in 
various computer and communication systems to guarantee small delays to 
short jobs without first knowing the service requirement of any job. In this 
paper we analyze the efficacy of one such scheme—a two-stage First-In First- 
Out (FIFO) and Round Robin (RR)—in discriminating between short and 
long jobs. We obtain the distribution of the delay for short jobs, the cycle time 
in the RR queue for long jobs, and the number of messages in the FIFO and 
the RR queues. For the specific parameters used in our numerical results, the 
two-queue scheme seems to discriminate effectively between the long and 
short jobs. 


I. INTRODUCTION 


In computer systems as well as data communication systems, it is 
frequently desirable to guarantee that short jobs see small delay even 
under a high load. This may be done at the expense of long jobs. It is 
also true that in many of these systems the time required to do a job 
is not known beforehand. Thus simple priority schemes based on the 
service requirements of jobs are not possible. If the jobs are served in 
order of arrival, First-In First-Out (FIFO), then all jobs will see long 
delays at high load. To discriminate between short and long jobs 
without knowing the type of a job beforehand, various schemes based 
on quantum service are used. The simplest of these is a Round Robin 
(RR) scheme. Here, when a job arrives, it is put behind all the waiting 


* Authors are employees of AT&T Bell Laboratories. 





Copyright © 1985 AT&T. Photo reproduction for noncommercial use is permitted with- 
out payment of royalty provided that each reproduction is done without alteration and 
that the Journal reference and copyright notice are included on the first page. The title 
and abstract, but no other portions, of this paper may be copied or distributed royalty 
free by computer-based and other information-service systems without further permis- 
sion. Permission to reproduce or republish any other portion of this paper must be 
obtained from the Editor. 


369 


jobs. When it reaches the server it gets at most A time units of service. 
If its service requirement is smaller than A, then the job leaves before 
the quantum A expires. Otherwise, after getting A units of service, it 
is put at the back of the queue and waits for the next pass at the 
server. Since a shorter job requires fewer passes, its delay is smaller. 
For this scheme, Wolff! obtained the mean delay conditioned on the 
service requirement as the solution of an infinite system of linear 
equations. Other schemes are possible if more discrimination is desired 
between short and long jobs. At one extreme we have a scheme based 
on infinite number of queues (IQ). In this scheme the server keeps an 
infinite number of queues numbered 1, 2, .... On arrival a job is 
placed at the back of the queue numbered “1.” When the server 
completes a service, it takes the first job from the lowest numbered 
nonempty queue. If this job is from queue n, then it gets at most A, 
time units of service. If its service is not complete by then, it is put at 
the back of the queue numbered n + 1. Schrage analyzed this scheme 
and derived the mean and Laplace-Stieltjes transform of the delay 
conditioned on the service requirement.? Somewhere between RR and 
IQ schemes are the schemes based on a finite number (N + 1) of 
queues. The first N queues behave as they do in the IQ scheme, while 
the last one can be either FIFO or RR. In this paper we study one 
such scheme. In particular, we consider the case where N = 1 and the 
second queue is served round robin. We call this queueing system a 
FIFO-RR system. The analysis for general N is almost identical but 
the resulting expressions and notation are more complex. 

Fraser and Morgan? have analyzed this FIFO-RR discipline as He 
model of the trunk service discipline in Datakit™ Virtual Circuit Switch 
(VCS) (see Ref. 3 for details of the trunk module operation in Datakit 
VCS). They obtain the mean delay for various classes of jobs under 
fairly general assumptions, essentially by extending the results in 
Wolff’ to the FIFO-RR system. They also use simulation to obtain 
the percentiles of the delay distributions. In this paper we focus on 
analytical methods to obtain information about the delay distributions. 
In particular, we derive a simple expression for the transform of the 
delay distribution for short jobs under the assumption of Poisson 
arrivals and general service time distribution. This transform is in- 
verted numerically to obtain the delay distribution. This enables us to 
get the delay distribution for one-character typed messages and short 
control messages in communication applications. For jobs long enough 
to require service in both FIFO and RR queues, the analysis is more 
difficult. Under more restrictive assumptions, we get the marginal 
generating function of the number of jobs in the FIFO and RR queues, 
the transform of the cycle time in the RR queue, and the mean sojourn 
time in essentially closed form. We illustrate our analysis with nu- 
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merical results from a data communication application such as the 
trunk service in Datakit VCS. In particular, we show that extremely 
short jobs see very short delay even under very high overall load. We 
also discuss how our model may differ from the actual service discipline 
in data communication applications and the performance implications 
of these differences. 

The analysis presented here for the RR queue uses busy cycle 
analysis to derive quantities of interest. Recently, Ramaswami showed 
that some of these quantities can also be derived using matrix 
methods.* 

This paper is organized as follows: In Section II we define the model 
formally and introduce the notation. The delay in the FIFO queue is 
analyzed in Section III. In Section IV we derive the performance 
measures for the RR queue. Finally, in Section V we illustrate our 
results with an application from communication over a 56-kb/s link. 


Il. MODEL 


In this section we formally define the model of the FIFO-RR queues, 
which we will analyze in Sections III and IV. The analysis of Section 
III is for the FIFO queue and thus will give the delay distribution for 
the jobs with service time less than or equal to the quantum size in 
the FIFO queue. This will be done under fairly weak assumptions. In 
Section IV we will analyze the RR queue under more restrictive 
assumptions. 

Assume that the arrival process of the jobs is Poisson at rate \. Let 
H be the distribution function of the service time. Let A; and Az 
denote, respectively, the quanta of service in the FIFO and the RR 
queue. 

In Section III we will let H be general. In Section IV we will assume 
that there are two types of jobs. A fraction p of the jobs are short 
enough to be completed within one quantum in the FIFO queue. Thus, 
if H, is the distribution function of the service time of the short jobs, 
then 


Hy, (Aj) = 1. (1) 


The other fraction, (1 — p), of jobs may be long and has distribution 
function 


Ho(x)=1l-e"™ O<x<0 (2) 
for some p > 0. Thus 
A(x) = pH\(x)+(1-—p)l-e™), Osx<o, (3) 


Let hj, and h;. denote the first two moments of H;, i = 1, 2. Of course, 
ho, = 1/p and hy, = 2/p?. 
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When a job arrives it is put at the back of the FIFO queue. When 
its turn arrives it gets up to A, units of service. If the complete service 
is not rendered by then, the job moves to the RR queue. The RR queue 
is served in a round robin way with quantum size A,. The FIFO queue 
has priority over the RR queue to the extent that after each quantum 
of service, the next service is from the FIFO queue as long as there is 
work in the FIFO queue. 


Il. ANALYSIS OF THE FIFO QUEUE 
Let 
q = H(A,), (4) 
and for i = 1, 
FS eS oe See (5) 
1- q1 
Let 
N2 = Y. ir; (6) 
i=l 
Thus N, is the expected number of passes at the server in the RR 
queue given that a job enters the RR queue. Let 
Q(t)=H(t) Ost<A, (7) 
and 
~ {H[A; + (i — 1)Ao + t] — H[A, + (@- 1)A 
OG) my ( a [Ai ( )Ao] 
i=1 — qi 
O<t<Ag. (8) 


Then the rate of service completions in the FIFO queue is \; = A, and 
the distribution of the service time, X,, in the FIFO queue is given by 


P{X, s t} = F(t) = Qi(t) = A(t), O0st<A,, (9) 


and 
P{X, = Ay} = Fi(Ai) — Fi(Ar) = 1 — A(d7). (10) 
The rate of service completions in the RR queue is 
do = AN2(1 — qi) (11) 


and the distribution of the amount of service, X2, in a typical service 
in the RR queue is 


P{X_ < t} = Fo(t) = Qo(t)/No, 0<t< As, (12) 
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P{Xz = Ag} = F.(A2) — F2(Az) = a + tee 

Now consider a nonpreemptive priority queueing system with two 
FIFO queues and one server. The arrival rate and the service time 
distribution in queue 1 and ); are F;, respectively, 1 = 1, 2. It can be 
shown using level-crossing arguments (see Refs. 5 and 6) that the 
distribution of the waiting time in the high-priority queue does not 
depend on the actual dynamics of arrivals in the low-priority queue. 
Thus, the waiting time distribution for an arbitrary arrival in queue 1 
for this system is the same as that for an arbitrary arrival in the FIFO 
queue in the original FIFO-RR system. Thus, let f,; and fo be the 
Laplace-Stieltjes transforms of F, and F2, respectively, and let W, be 
the Laplace-Stieltjes transform of the waiting time in the FIFO queue. 
Let 


(13) 


a=” f tdF\(t), (14) 


cone f " tdF,(t). (15) 


Then, from Ref. 7, 
s(l1—-—% — &) + Az[1 — fo(s)] 
ss At + Ai fi(s) 
if G+te<l1 
1-% Agll - fo(s)] 
s—\ + A fi(s) $2 


Wis) = oe 


if G+t%&21, 4<1. 


Equation (16) can be inverted using a method of Jagerman® to obtain 
the waiting time distributions numerically. 

Let us now consider the total sojourn time (waiting time + service 
time) for a job in the FIFO queue. Its Laplace-Stieltjes transform is 
given by 

Dy(s) = W,(s) f(s) 


_ (l= f = &) + Mf = Als) 
s— i + Afi(s) 


Also, the transform of the total time in the system for a job that has 
service requirement x < A, is given by 


D,x(s) = W,(s)e~™. (18) 


-f,i(s). (17) 
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The FIFO queue is essentially an M/G/1 queue with arrival rate \ 
and service time distribution F,;. Thus, from Ref. 9, the distribution 
of the number in the system at an arbitrary instant is the same as 
that at an arbitrary arrival epoch and is the same as that seen by a 
random departure from the FIFO queue (either exiting the system or 
going to the RR queue). Also, the distribution {P,,x} of the number in 
the FIFO queue at a random departure epoch is related to the sojourn 
time distribution by 


P,(z) = Y Pyxz* = Dy[AQ1 — 2z)). (19) 


Thus the generating function of the number in the FIFO queue at an 
arbitrary instant is given by 


P,(z) = Di[M — 2)], (20) 
where D, is given by eq. (17). 
IV. ANALYSIS OF THE RR QUEUE 


In this section we mainly derive the expressions for various quan- 
tities of interest for the RR queue. However, in that process we also 
obtain some additional quantities related to the FIFO queue. As 
mentioned earlier, in this section we will use the following distribution 
function of the service time: 


H(x) = pHi(x)+(l1—p)\l-e™), Osx<.o, 
with 
A,(A;) = 1. 


As in Section III, let X, and X, denote the service times in typical 
chunks of service in the FIFO and the RR queues, respectively. Let F, 
and F; be the distribution functions of X, and X2, respectively. Then, 
from Section III, we have 


F,(x) = pe tis pyle) : : i Ai 


l-e” O<x<A 
ray ={} sean ; 


(1 — p) 
uts 





fi(s) = phi(s) + {u + se Ota} 


m 1 
s) = —— {pu + se t9)42} 
fe(s) Das {u } 


(1 — p)(l - ay 


io = » [ph + 
be 
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and 
MS pjer™ 
cans al ’ 


6) 


We begin by defining and analyzing various busy periods and cycles 
associated with the FIFO-RR system. These will be used subsequently 
to derive quantities of interest. The system is said to be busy as long 
as a job is being processed at high or low priority. The continuous 
interval of time during which the system is busy is called a system- 
busy period. A 1-busy period is started by a job arriving at the system 
while the server is idle and lasts until no job is left in the FIFO queue 
(so that the server moves to the RR queue). A 2-busy period is started 
by a service quantum in the RR queue and lasts until the end of this 
quantum and the time required to empty the FIFO queue. Note that 
each service quantum in the RR queue generates a 2-busy period and 
that a system-busy period consists of exactly one 1-busy period, which 
triggers off the system-busy period and is followed by zero or more 2- 
busy periods. 

Let (x, k) denote the joint probability that the length of the system- 
busy period is less than or equal to x and that during this busy period 
exactly k jobs get routed to the RR queue after completing their service 
quanta in the FIFO queue. Let 


B(s, 2) a, y e~*z*dB(x, k) (21) 
O kR=0 


denote the joint transform of B(x, k). 

Similarly, let B, (x, k)[B2(x, k)] denote the joint probability that the 
length of a 1-busy period (2-busy period) is less than or equal to x and 
that during this busy period exactly k jobs are moved to the back of 
the RR queue after receiving one service quantum during that cycle. 
In the case of a 2-busy period, k includes the job in the RR queue that 
started this busy period if it was routed to the back of the RR queue. 
Let 6,(s, z) and 62(s, z) denote the joint transforms of B,(x, k) and 
B,(x, k), respectively: 


Bils, z) =e i e* y 2'dB;(x, k), i = 1, 2. (22) 
0 


k=0 
In the Appendix we obtain expressions for these quantities in the form 
of functional equations. 

We will now derive the expressions for the cycle time, the distribu- 
tion of the number in the RR queue at an arbitrary instant, and the 
mean sojourn time in the RR queue. The actual distribution function 
of the sojourn time does not seem to lead to a simple form. 
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First we consider the number in the RR queue at special time points. 
We look at the points in time when a service quantum has just 
completed and the FIFO queue is empty. The interarrival times of the 
new arrivals and the remaining service requirements of the jobs in the 
RR queue are independent random variables with exponential distri- 
butions. Thus the number in the RR queue at these imbedded instants 
forms a Markov chain. Let n; denote the number in the RR queue at 
the kth such instant. Then we have the following transition mecha- 
nism: 


Lh if n, = 0, 


where 4, denotes the number of jobs sent to the back of the RR queue 
during a 2-busy period (including the message in the RR queue that 
started this 2-busy period if it gets sent to the back of the RR queue 
after completing its service quantum), and 7; denotes the number sent 
to the RR queue during a 1-busy period. 

Let W,(z) be the generating function of n,. Then eq. (23) can be 
rewritten as 


[W.(z) — ¥,(0)]62(0, z) 
2Z 


Wisi (Zz) = + (0) 6, (0, z), (24) 


and the equilibrium generating function Y(z) = lim;_.V;,(z) is given 
by 


V 25 
(2) 27 B2(0, z) ( ) 
Equating W(1) with 1, we get the unknown YW(0) as 
_ 1-b& 
i ale ar ae 
where 
pp CeO 2) i= 1,2. (26) 
dz z=1 


We now evaluate the Laplace-Stieltjes transform of the cycle time 
defined as the time interval between two successive passes through 
the server by a job in the RR queue. Let ¢, and t, be the instants at 
which the server begins to provide two successive service quanta to a 
tagged job in the RR queue. (In case the tagged job leaves the system 
after receiving the first quantum, ft, is the instant at which the job 
would have begun to receive the second quantum had it still been in 
the system.) Then t, — ¢; is the cycle time. 
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Let m denote the number of messages in the RR queue at time ¢,. 
Then the generating function of m is given by 


E[z”] = E[z"|n = 1], (27) 


where n is the number in the RR queue at the imbedded instants 
discussed above. Thus 


W(z) — V(0) 
1-wW(0) ’ 

where W is as given by eq. (25). Now, because of the memoryless 

property of the service requirements of jobs in the RR queue, the cycle 


time t, — t, is the sum of m independent and identically distributed 2- 
busy periods. Thus 


x(s) = E[s“-] = E[B.(s, 1)”] 


_ ¥(B2(s, 1)) — ¥(0) 
1— V(0) 


We can now obtain an expression for the generating function of the 
number in the RR queue at an arbitrary instant. 

Let 7 and n denote, respectively, the number of jobs in the RR 
queue just after an arbitrary departure from and just before an arbi- 
trary arrival to the RR queue. Then ni and 7 have the same distribution. 
Let n denote, as before, the number in the RR queue at an instant 
when a service quantum has just completed and the FIFO queue is 
empty. Then 


P{n = k} = P{n = k} 

P{n = k + 1 and a departure occurs at 
_ ___the end of this service quantum} 
~ P{n = 1 and a departure occurs at the 

end of this service quantum} 
_ Pin=k + i(1 — e*) 
P{n = 1}(1 — e742) 

_Pin=k+i} 
~  Pin=1} 


Elz") = (28) 


(29) 


(30) 


Now, the number of jobs in the RR queue just before a randomly 
selected arrival to that queue is the same as the number in the RR 
queue when this tagged job began to receive its first (and only) service 
quantum in the FIFO queue. This number is a function only of the 
arrivals prior to the arrival of the tagged job. Also, this number is 
independent of the tagged job’s service time in the FIFO queue and, 
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in particular, is independent of whether or not the tagged job enters 
the RR queue. Therefore, the number in the RR queue when an 
arbitrary job completes its service in the FIFO queue has the same 
distribution as n. Its generating function is given by 


foo) 


Y P(n = k)z* 

K=0 

°° P(n=k + 1)z* 
K=0 P(n 2 1) 


_ (¥(2) - ¥0) 
z(1 — W(0)) 


We can now derive the generating function of the number in the 
RR queue at an arbitrary instant. If the observation instant lies in an 
interval of time during which the server is serving the FIFO queue, 
the number of jobs in the RR queue is the same as when the job being 
served finishes its service quantum in the FIFO queue, that is, it has 
the generating function £(z). If the server is working on a job in the 
RR queue, then the number in the RR queue has the same distribution 
as the variable m defined above, that is, it has the generating function 


W(z) — ¥(0) 
1 — (0) 


Finally, if at the observation instant the system is empty, the gener- 
ating function of the number in the RR queue is 1. Thus the generating 
function of the number in the RR queue at an arbitrary instant is 
given by 


£(z) 


(31) 


W(z) — ¥(0) 


P,(2) = G&(z) + & 1- ¥(0) 


+ (1 — 6 — $). (32) 
The average number in the RR queue at an arbitrary instant is 
given by 


= dP,(z) 


Ln dz 


(33) 





z=1 


Finally, the mean sojourn time in the RR queue can be obtained by 
using Little’s law for that queue: 


dP,(z) 
Be Bass Ue 
eo =e ie 





z=1 
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V. SPECIAL CASES AND NUMERICAL EXAMPLES 


We now consider two special cases of the general model analyzed in 
Sections III and IV. These examples are typical of some communica- 
tion applications. The service times here will correspond to the number 
of characters in the message. 

The first case corresponds to three types of jobs: one time unit long, 
A, time units long, and A, + nA, time units long (n = 1). Let \ be the 
total arrival rate. Let X denote the service time of a job. Let 


Qu = P{X = 1}, (35) 
qe = P{X = Ai}, (36) 
Mi = du + Quiz, (37) 
pee oe ae) = == a eT (38) 
Then 
N= ¥ nr, (39) 
fi(s) = que“ + que“ + (1 — qe™™ 
= que" + (1 — que, (40) 
6 = Alqu + Ai(l — qui)] (41) 
fo(s) = e™, (42) 
ho = AN2(1 — qu), (43) 
and 
S2 = Ao Ag. (44) 
Thus, : 
Wils) = s(1 — & — $2) + As[l — fo(s)] 


= M1 + M1 fis) 
(s{1 — Afqu + (1 — qur)Ai}] — ANeA(1 — qi)} 


= + AN2(1 — m)(1 — e7*)) (45) 
s—dX+ Aque~* + (1 — quje™™) ; 
The second example corresponds to the traffic mixture assumed in 
eqs. (1) and (2), that is, a proportion p of the jobs have service time 
less than or equal to A; and others have service time exponentially 
distributed with mean 1/y. Thus, 
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qm =pt+(il-—p)(l-e™™), (46) 


LG pews (47) 
Qi(t) = pHy(t) + (1 — p)(1 — e™*), (48) 
0Ost<A,, 
Fy(t) = Q(t) = pHy(t) + (1 — p)(1 — e™), (49) 
Ost<A, 
and 
F\(A,) — Fi(Az) = (1 — p)(l - e™). (50) 


Let h, be the Laplace-Stieltjes transform of H,. Then, from eqs. (1) 
and (2), we get 


Ay 
fi(s) = phi(s) + (1 — p) ih e~we“'dt + (1 — p)e sare 
0 


= phi(s) + (1 = D) a (1 = e Aust») ) + een 
Sty 


= phy(s) + (1 = p) (u + se7t#)41) (51) 
Stuy 
Also, 
(e Hart D4) ion eHArtide) 77 i p] 
r,; = 
1 = qi 
—pAg(t~1) 5 HA, — po A2 = 
= e e (1 e yA Pp) ; (52) 
1-q 
Thus 
3 e*1(1 — p) 1 
N22 = = | FE OOOO 53 
>= G—aa-e™) de) - 
_ de *"1(1 — p) 
Ao = (le) (54) 
Finally, 
F.(t)=1—-—e™, 0<t< Ag, (55) 
and 
F2(Ae) — Fo(Az) = e**. (56) 
Thus, 
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Ag 
fo(s) = i pe “edt + e~hAze she 
0 


— B (1 _ e7 te)42) ie e7(stu)Ag 








Stu 
1 —(st+p)Ag 
leg ). (57) 
For ¢; and {, we get 
1-p)(1l-e™ 
“=r (oh a Gopi ae ’ (58) 
and 
ee he*“1(1 — p) 1 — e 42 
7 (1 = e742) be 
AG - one 
a arene (59) 


Thus, from (16) we get 


v7 (5) — 0b f= &) + All — f(s) 
eae) i s§ — M1 + Ai f(s) 


fe 
o(1 ane Aph, = r a=) + Ags (1 pos etn) 42) 
HL Stu 
a, (60) 


s- Ap (1 -_ hi(s)) = ACL = p)s (1 _ e7(stw)A1) 
ut+s 


In communication applications there is usually an overhead associ- 
ated with each segment of transmitted data. Thus, with a typical 
segment of X, characters transmitted from the FIFO queue, 6, over- 
head characters are added. Similarly, 6. characters are added to each 
segment of data transmitted from the RR queue. The effective num- 
bers of characters sent in a typical service segment from the FIFO and 
the RR queue are then Xj = X, + 6, and X3 = X2 + 62, respectively. 
The corresponding transforms are then f/(s) = e~**f;(s). If we replace 
f; and f, by f{ and f{ and adjust the occupancy numbers accordingly 
in all the waiting time transforms, the resulting expressions will give 
transforms of the waiting time in the presence of the overhead char- 
acters. Besides these overhead characters associated with each service 
segment, there are usually overhead characters associated with frames, 


MULTISTAGE QUEUEING 381 


that is, data from various service segments are combined into frames 
of some maximum size and, at the end of each frame, framing protocol 
characters are added. The exact analysis of the waiting time in 
presence of the framing overhead is not easy, but good approximations 
can be obtained by distributing the framing overhead over all trans- 
mitted characters. We have chosen to exclude the framing overhead 
in our numerical calculations. 

Next we numerically evaluate performance measures for short and 
long messages for a few traffic mixes. We assume that the communi- 
cation link runs at 56 kb/s. Thus, each character corresponds to 1/7 
ms of delay. We use 6, = 62 = 2 in all the cases described below. 

First consider short messages (= A, characters). The Laplace- 
Stieltjes transform of the waiting time is given by eq. (16) with 
appropriate modifications to account for the overhead characters. We 
numerically inverted this transform using the inversion algorithm of 
Jagerman® for the following traffic mixes and quanta sizes (these 
traffic mixes are selected to give the same mean number of characters 
per message): 

1. P(X = 1) = 100/111, P(X = A,) = 10/111, P(X = A; + nA.) = 
rn X 1/111, where }°_, r, = 1 and No = Y%_, nrn = 99/6. Also, A, = 
16, Ap = 48. This will be called the traffic mix M,. Note that N» 
uniquely defines the delay distribution irrespective of the individual 
values of r,s. 

2. P(X = 1) = 100/111 and with probability 11/111, X is exponen- 
tially distributed with mean 912/11. A, = 16, Az = 48. This will be 
called the traffic mix Mo. 

3. For the third traffic mix, M3, we assume that P{X = 1} = 
100/111, P{is exponentially distributed with mean 40} = 10/111 and 
P{X is exponentially distributed with mean 512} = 1/111. A; = 16 and 
Ao _ 48. 

4. The traffic mix here is the same as in M3 but we use A, = A> = 
16 and A, = A, = 64. These cases will be denoted by M3 and M3, 
respectively. With these sizes of the quanta the cases M3 and M? 
correspond to the cases studied in Refs. 3 and 10 with and without the 
framing overhead, respectively. We will use the results in Ref. 10 to 
cross check our calculations. 

Figures 1 through 3 show the tails of the delay distribution for short 
messages under the traffic mixes M,, Mz, and Ms, respectively. Two 
occupancy numbers are given for each curve in these figures. The 
lower number is the raw occupancy, while the larger number indicates 
the total occupancy including the overhead characters. A number 
larger than one indicates the saturation of the RR queue and an 
indication that not all the offered messages will be completely trans- 
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Fig. 1—Delay distribution for traffic mix M1: short messages. 
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Fig. 2—Delay distribution for traffic mix M2: short messages. 
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Fig. 3—Delay distribution for traffic mix M3: short messages. 


mitted. Note, however, that the occupancy of the server due to the 
FIFO queue is still well below one for these curves. 

It is clear from these figures that, for the parameters chosen, the 
short messages will see very short delays due to queueing even when 
the long messages see essentially infinite delays. Of course, if the 
occupancy of the server due to the FIFO is close to one, the short 
messages will also see long delays. 

As mentioned earlier, cases M3 and M3 were selected to match the 
traffic mix and the quanta sizes of those in Refs. 3 and 10 so that a 
cross check can be carried out. In these references Fraser and Morgan 
get the delay distribution (in particular, the 95th percentile of the 
delay distribution) for short messages via simulation. Since the results 
in Ref. 3 are obtained in presence of the framing overhead, they cannot 
be compared directly with our results. However, in their unpublished 
work Fraser and Morgan” obtain the 95th percentile of the delay 
distribution without the framing overhead. In Fig. 4 we plot the 95th 
percentile of the delay as a function of the raw occupancy of the server 
for M3 and M3. The simulation points from Ref. 10 are superimposed 
on these curves and the agreement looks very good. 

We next look at other performance measures studied in Section IV. 

We have chosen two different traffic mixes to illuminate the effects 
of various load parameters on the performance measures relevant to 
queue 2. The traffic mixes are: 
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Fig. 4—95th percentile of the distribution for short messages. 


1. A message has a single character in it with probability 0.98, and 
with probability 0.02 it is exponentially distributed with a mean length 
of five hundred characters. We call this traffic mix M4. 

2. The probability of a message being a single character one is 0.9, 
and with probability 0.1 it is exponentially distributed with a mean 
length of one hundred characters. This traffic mix will be referred to 
as M;. 

In both cases we assume that the “quantum overhead” (i.e., 5; = 62) 
is two characters. The quantum size A, is assumed to be 16 characters 
and A, to be 48 characters. The performance measures relevant to 
queue 2 include the mean cycle time, the mean sojourn time (for 
messages entering queue 2), and the mean queue length. These are 
plotted as functions of the overall occupancy (or, equivalently, the 
arrival rate) for both traffic mixes in Figs. 5 and 6. 

From Figs. 5 and 6 it can be seen that the mean sojourn time for 
messages entering queue 2 varies essentially in direct proportion to 
the mean message length of type 2 jobs and in inverse proportion to 
(1 — p), where p is the overall utilization of the server. The average 
cycle time also appears to vary inversely in proportion to (1 — p); 
moreover, it is insensitive to the mean length of type 2 messages as 
long as it is large compared to Aj. The mean queue length also displays 
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Fig. 5—Some RR performance curves for traffic mix M4. 


a similar behavior (a p/(1 — p)). The assumption of exponentially 
distributed message lengths for jobs in queue 2 certainly plays an 
important role in this behavior. However, it is heartening that the 
mean cycle time should depend upon the server occupancy alone and 
be insensitive to the mean message length. 


VI. REMARK 


In data communication systems providing virtual circuit service it 
is necessary to move virtual circuits rather than individual messages 
from the FIFO to the RR queue and vice-versa. That is, once A; 
characters are removed for a virtual circuit in the FIFO queue, it is 
moved to the RR queue. On successive turns A, characters are removed 
from this virtual circuit until there are no data to be transmitted on 
the virtual circuit. At that time the virtual circuit is moved back to 
the FIFO queue so that the first part of the next message is served 
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Fig. 6—Some RR performance curves for traffic mix M5. 


from the FIFO queue. This, of course, implies that a short message 
immediately following a long message may see considerably longer 
delay than predicted by our analysis. This is unavoidable but not likely 
to happen often in practice. 

Next, the access lines bringing data to the node that serves the link 
under consideration may be running slower than the 56-kb/s link. In 
that case a long message will be seen as a number of short messages 
by the node using FIFO-RR discipline. This will tend to increase the 
delay for the short messages. The effects of slower access lines are 
studied in Refs. 9 and 11. Also, under a heavy load, flow control may 
force a long message to be transmitted as a number of shorter mes- 
sages, thus increasing the utilization in the FIFO queue. The effect is 
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similar to that of the slower access lines. Essentially, this breaking up 
of long messages allows the same message to reappear in the FIFO 
queue every so often. This could be discouraged by forcing each virtual 
circuit to pass through the RR queue for at least one cycle after every 
service in the FIFO queue. Only when a virtual circuit in the RR queue 
is found empty, it is moved back to the FIFO queue. Genuinely short 
messages could be exempted from this requirement by making the 
decision to move the virtual circuit from the FIFO to the RR queue 
depend on whether A, or fewer than A;, characters were transmitted. 
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APPENDIX 


Analysis of the Busy Periods 


We now derive expressions for the joint transforms of 6,(x, k) and 
B2(x, k) and the Laplace-Stieltjes transform of the system-busy period. 
Recall that 
Bi(s, z) = Efe~*z*] 
= Ye *z"'dB,(x, k), p=, (61) 


0- R=0 


where b; denotes the length of an i-busy-period and K the number of 
jobs moving to the back of the RR queue during this busy period. 
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Let X denote the total service time of a job, X, the portion that gets 
served in the FIFO queue, and N, the number of new arrivals during 
the time X,. Then 


Bi(s, z) = E[E[e2* | X, Ni], (62) 
where 
—sX QN: 
sb, ,K _ Je“"Bri(s, Z) 0OsXsA, 
Flee eae Lan z) X>A. (63) 
Thus 
Ay 
Bx(s, z) =a i e *e~0-Ails2)] dq A (x) 
0 
41 
4 (1 — P) { pe "ee ™ll-Ai(s,2)] dy 
0 
+ (1 — p) { e Arze Ail “Bils.2)] pe H* ly 
A 
= ph,[s + A(1 — fi(s, z))] + (1 - p) 
m _ “ 
2 ————_ ] — Ay(u+s+A(1—B,(s,z))) 
aren eee ee ‘ 
+ (1 _ p)ze te AAA 
= fi(s + (1 — Bi(s, z))) 
+ (1 pon py(z _ Le Atereta Rite2)) (64) 


where f; is as in Section IV. 
Similarly, let X. denote the length of a typical service in the RR 
queue. Then 


Efe~™z* | X»] = e X2g—Mall-Ailsz)) | 0< Xo < Ao, 
= eX2gg—Xo(1-Bils,2)) X, = Ao. (65) 


Thus 
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Bo(s, z) = E[E[e*2* | Xe] 


do 
= { pe “HX op —x(st+r(1—B,(s,z))) dx 
0 


+2 en pts+d(1—B,(s,z))) 


= Ld —Ao(utst\(1-B;(s,z))) 
ee a, 1 = yates 1 
Foor Gee) 


+ ze —Ao(uts+A(1—-8,(s,2))) 


= fo(s + (1 — Bils, 2))) + (z — Le Ae tsAt A266) 


Finally, since the system is work conserving, the system-busy period 
is the same as that in an ordinary FIFO system. Thus, its Laplace- 
Stieltjes transform £ is given by 





B(s) = h[s + A(1 — B(s))], (67) 
where 
A(s) = phy(s) + (1 — p) ; ar = (68) 


In the presence of “chunk overheads,” analytic expressions for the 
busy-period transforms can be obtained by making proper substitu- 
tions for h(-), fi(-), fe(-), etc., in eqs. (64), (66), and (67). 
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Markov Models 
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We propose a probabilistic distance measure for measuring the dissimilarity 
between pairs of hidden Markov models with arbitrary observation densities. 
The measure is based on the Kullback-Leibler number and is consistent with 
the reestimation technique for hidden Markov models. Numerical examples 
that demonstrate the utility of the proposed distance measure are given for 
hidden Markov models with discrete densities. We also discuss the effects of 
various parameter deviations in the Markov models on the resulting distance, 
and study the relationships among parameter estimates (obtained from rees- 
timation), initial guesses of parameter values, and observation duration 
through the use of the measure. 


l. INFRODUCTION 


Consider two N-state first-order hidden Markov models specified 
by the parameter sets \; = (u®, A”, B®), i = 1, 2, where u™ is the 
initial state probability vector, A“ is the state transition probability 
matrix, and B™ is either an N X M stochastic matrix (if the observa- 
tions are discrete) or a set of N continuous density functions (if the 
observations are continuous).’ Our interest in this paper is to define a 
distance for every such pair of hidden Markov models (A, Az) so we 
can measure the dissimilarity between them. Another goal is to study 
the properties of hidden Markov models, using the distance measure, 
in order to understand the model sensitivities. 
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The need of such a distance measure arises mainly in estimation 
and classification problems involving Hidden Markov Models 
(HMMs). For example, in using a reestimation algorithm to iteratively 
estimate the model parameters,” a distance measure is necessary not 
only to monitor the behavior of the reestimation procedure, but to 
indicate the expected performance of the resulting HMM. In classifi- 
cation, a good distance measure would greatly facilitate the nearest- 
neighbor search, defining Voronoi regions’ or applying the generalized 
Lloyd algorithm‘ for hidden Markov model clustering. 

The only measure for comparing pairs of HMMs that has appeared 
previously in the literature is the one proposed by Levinson et al. for 
discrete-observation density hidden Markov models.! The distance, 
which is a Euclidean distance on the state-observation probability 
matrices, is defined as 


1 N M 1/2 
d(du, d2) & [|B - B || 4 YX & [by - osu , (1) 
MN A 

where B“ = [b§}] is the state-observation probability matrix in model 
\; and p(/) is the state permutation that minimizes the measure of eq. 
(1). The metric of eq. (1) was called a “measure of estimation error” 
in Ref. 1 and was used to characterize the estimation error occurring 
in the reestimation process. Minimum bipartite matching was used to 
determine the optimum state permutation for aligning the states of 
the two models. The measure of eq. (1) did not depend at all on 
estimates of u or A, since it is generally agreed that the B matrix is, 
in most cases, a more sensitive set of parameters related to the 
closeness of HMMs than the u vector or the A matrix. 

The distance measure of eq. (1) is inadequate for the following 
reasons: (1) it does not take into account the deviations in all the 
parameters of the HMM; (2) its evaluation requires a great deal of 
computation in the discrete case and probably would become intract- 
able when dealing with continuous-observation hidden Markov 
models; and (3) it is unreliable when comparing HMMs with highly 
skewed densities. Hence, our aim is to find a distance measure that 
truly measures the dissimilarity between pairs of hidden Markov 
models, can be easily evaluated, is reliable for any pair of Markov 
models, and is meaningful in the probabilistic framework of the HMM 
itself. 

In this paper, we propose such a distance measure for comparing 
pairs of HMMs that follows the concept of divergence,’ cross entropy, 
or discrimination information.’ The distance measure, denoted by 
D(Ay, Az), has the form 


log Pr (Or| Ai) — log Pr (Or| Az), 
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where Or symbolizes an observation sequence of T observations. 
Because the distance measure is the difference in log probabilities of 
the observation sequence conditioned on the models being compared, 
it will sometimes be referred to as “divergence distance” or “directed 
divergence measure.” In the next section, we formally define the 
distance measure, and we discuss Petrie’s results’ that further give the 
distance measure theoretical justification. In Section II, numerical 
examples related to discrete-observation hidden Markov models are 
given. We show the effects of individual parameter deviations upon 
the distance measure and demonstrate several interesting properties 
of discrete-observation models that are made explicit through the use 
of the proposed distance measure. A discussion of the use of such a 
distance measure in continuous-observation models is given in Ref. 8, 
where hidden Markov models with continuous mixture densities are 
discussed. 


Il. DEFINITION OF THE PROPOSED HMM DISTANCE MEASURE 


In this section, we define the distance measure for any pair of 
Markov models, discuss Petrie’s Limit Theorem and statistical anal- 
ysis of probabilistic functions of Markov chains,’ and then give the 
proposed distance measure an interpretation from the Kullback- 
Liebler statistic point of view. The presentation is explicit for discrete- 
observation models but can easily be extended to continuous-obser- 
vation cases. 

Let o&% = {1, 2, --- , N} be a state alphabet, and let 2% = {, yo, 

-, ym} be an observation alphabet. The Cartesian product 2. = 
Il: Gor, Lo: = BW, for all t, forms an observation space in which 
every point O has coordinate 0; © %, = 2%. We are concerned about 
a class of stochastic processes generated by a hidden Markov source 
defined by an N X N ergodic stochastic matrix A = [a;] and by an 
N XM stochastic matrix B = [b;,]. Matrix A, the state transition 
probability matrix, generates a stationary Markov process S= --- 
S:-18/S:4+1 +--+ according to a; = Pr {s:4, =j|s,; = i}. Based upon 8S, B 
generates o, according to bj, = Pr {o; = yx|s; = j}. Let a* = [a, ao, 

- , ay] be the stationary absolute distribution vector for A, i.e., 
a* A = a*, where * denotes the transpose. Then, matrices A and B 
define a measure, denoted by p(- | A), where A = (A, B), on @., by 


T 
u(Or| d) = >, as, II As,_18,0s,1(0,)3 (2) 

allS; t=1 
where Or = (0), 02, ---, Or) is the observed sequence up to time T 
(ie., a truncated O), Sr = (So, S;, ---, Sr) is the corresponding 


unobserved state sequence, and J(-) is the index function 
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I(o,) =k if Of = ye. 


Let A, be the space of N xX N ergodic stochastic matrices, A, be the 
space of N X M stochastic matrices, and A = A, X A». Clearly, A € A, 
and for every point in A there is a stationary measure p(- | A) associated 
with it. 

Now consider a probability space (@., u(-|Xo)), which will be 
abbreviated as (.@., Xo) in the following without ambiguity. Let an 
observation sequence O,7 be generated according to the distribution 
u(- | Ao). 

For each T' and each O € @.,, define the function H7(O, A) on Aby 


= 
iy 


Each H7(-, A) is thus a random variable on the probability space (2.., 
Xo). Also, for a given fixed observation O7, H(O, X) is a function on 
A. Petrie’ proved (limit theorem) that for each din A, 


Hr(O, d) = wlog u(Or| A). (3) 


lim H,(O, A) = lim ae u(O-7| A) 
To T0 T : 
= H(Xo, A) (4) 
exists almost everywhere y(- | Ao). Furthermore, 


with equality if and only if \ € G(Xo) = {A € Al u(- | A) = w(- | Ao) as 
measures on @..}. Define A7(O) = {rX’ € A| Hr(O, A) is maximized at 
d’}. Then, Ar(O) — G(Ao) almost everywhere p(-| Ao) (see Ref. 7, 
Theorem 2.8). The results give further justification to the well-known 
reestimation procedure” for Markov modeling. 

With the above background, we define a distance measure D(Xo, A) 
between two Markov sources Apo and A by 


D(Ao, d) = H(do,; Xo) —_ H(do, d) 
= lim J [log n(Orl2o) ~ log u(OrId)}. 6) 


The aforementioned limit theorem guarantees the existence of such a 
distance measure and eq. (5) ensures that D(X, A) is nonnegative. 
D(Xo, A) = 0 if and only if A € G(Apo), a point that is indistinguishable 
by the associated probability measure. 

By invoking ergodicity,’° we see that the distance is in fact the 
Kullback-Leibler number® between measures p(- | Ao) and p(- | A). If 
& and &, are the hypotheses that O, is from the statistical population 
with measure p(- | Ao) and u(- | A;), respectively, D(Xo, A) is then the 
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average information per observation sample in O7 for discrimination 
in favor of &% against &,. Since O,7 is generated according to p(- | Ao), 
limr_,.(1/T)log u(O7| Ao) should be a maximum over A, and D(X, A) 
is a measure of directed divergence, from Xp to A, manifested by the 
observation O7. 

The distance measure of eq. (6) is clearly nonsymmetric. A natural 
extension of this measure is the symmetrized version of eq. (6), i.e., 


D,(Ro, 0) = 5 [D(o, 9) + DO, doh (0 


which is the average of the two nonsymmetric distances. D,(Xo, A) is 
symmetric with respect to Xo and A and represents a measure of the 
difficulty (or ease) of discriminating between p(- | Ao) and u(-|A), or 
equivalently, X) and X. For our purpose, however, there is no particular 
requirement that the distance be symmetric, and our study will mainly 
concentrate on the definition of eq. (6). 


Ill. DISCRETE-OBSERVATION HIDDEN MARKOV MODELS 


Using the distance measure of eq. (6), we have studied the behavior 
of several discrete-observation hidden Markov models. In this section, 
we present some results on the sensitivities of the reestimation pro- 
cedure to observation sequence length, initial parameter estimates, 
etc. We begin with a discussion of the evaluation of such a distance 
measure. 


3.1 Evaluation of the distance measure 


Evaluation of the distance of eq. (6) is rather straightforward. A 
standard Monte Carlo simulation procedure based upon a good random 
number generator is used to generate the required observation se- 
quence O7 according to the given distribution p(- | Ao). The probabili- 
ties of observing the generated sequence from models Xo and X are 
then calculated respectively. By way of example, Fig. 1a shows the 
logarithm of u(O;r] Ao) and u(O7| A), respectively, as a function of the 
observation duration T. The resulting distance D (Apo, A) is then plotted 
in Fig. 1b. For this example, A» = (Ao, Bo), A = (A, B), N = M = 4, 
where 


0.8 0.15 0.05 0 0.3 0.4 02 0.1 
Ane 0.07 0.75 0.12 0.06] , _ | 0.5 0.3 0.1 0.1 
°™~ 10.05 0.14 O08 0.01 °~ 10.1 0.2 0.4 0.3 


0.001 0.089 0.11 0.8 04 03 0.1 0.2 


and 
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u(O7Mp) 


LOG PROBABILITY 


u(Or7/r) 





0.4 (b) 


0.3 


DISTANCE 


0.2 


0.1 


0 220 
NUMBER OF OBSERVATIONS, T 
Fig. 1—(a) Log probabilities u(O7| Xo) and u(O7| A) versus the number of observations 


for a pair of models that are close in distance. (b) Distance D(X, A) versus the number 
of observations for the same pair of models. 


04 0.25 0.15 0.2 0.1 0.15 0.65 0.1 
aw [0-27 0.45 0.22 0.06] 2 |02 03 04 0.1 
0.35 014 04 O11 03 03 O1 03] 
0.111 0.119 0.23 0.54 0.15 0.25 04 0.2 


We can see from Fig. 1 that, for this example, it takes around 150 
observation samples to converge to a distance of 0.14 (to within 
statistical fluctuations). It is readily shown that the number of obser- 
vations needed for convergence of the distance to a fixed value is 
dependent on N and M. 
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Although the definition of the distance of eq. (6) requires that the 
pair of models being compared both be ergodic and that there exist a 
stationary absolute distribution vector a such that a* A = a*, practical 
evaluation of the distance can still be performed for other types of 
‘Markov models. We often define the distance measure by replacing 
the stationary equilibrium distribution vector with the initial state 
probability vector. In the case of left-to-right models,’ we use a series 
of restarted sequences as the generated sequence for distance evalua- 
tion, because of the trap state in left-to-right models. In fact, except 
for some possible minor theoretical discrepancies (which might be 
traced back to the problem of nonergodic model estimation), the 
proposed distance measure appears to work quite reliably for any 
pair of such HMMs. Particularly, in the previous example, the initial 
state probability vectors associated with models Xy and A were 
ug = [0.75 0.15 0.05 0.05] and u* = [0.4 0.25 0.15 0.2], respectively. 


3.2 Effects of parameter deviations on the distance 


We are interested in studying the relationship between parameter 
deviation and model distance, as well as the relative sensitivity of the 
distance to different parameter sets that define the HMMs. To illus- 
trate such parameter sensitivities, we have studied HMMs whose 
parameters are related to the matrices W, and We, 


0.3 0.4 0.2 0.1 0.05 0.1 0.65 0.25 
Ww. = 0.5 0.3 01 0.1 W. = 0.1 0.05 0.75 0.1 
1 10.1 0.2 0.4 0.31]? 2 |0.45 0.45 0.05 0.05]? 


0.4 0.3 0.1 0.2 0.05 0.1 0.65 0.2 


and to the vector v* = [0.75 0.15 0.05 0.05]. In particular, model Ao 
is defined by (uo, Ao, Bo), where Uo = v and Ay = Bo = Wj. We chose 
Ao = By to avoid a priori numerical difference in different parameter 
sets. The alternate model, \ = (u, A, B), is varied from Xo by modifying, 
in turn, either A or B. 

We first study the effect of changes in only the state transition 
probability matrix on the computed distance. We form a sequence of 
models A = (u, A, B), where u = up = v, B = Bo = W; and 


1 6 
A= (4) Wi + (4) W2, (8) 


with 6 varying from 0.001 to 0.991 in 99 equal steps. For each pair 
(Ao, A), D(Ao, A) is then evaluated. The bottom curve in Fig. 2a shows 
a plot of D(Xo, A) as a function of the deviation factor 6. Furthermore, 
for potential geometric interpretations, we calculate the signal-to- 
noise ratio ya, defined by 
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Fig. 2—(a) Relationship between the probabilistic distance and the model deviation 
factor 6 of the A and B parameters for a pair of HMMs. (b) Relationship between the 
probability distance and measured parameter deviations of the A and B parameters for 
a pair of HMMs. 


[| Aol? 
|] Ao — All?’ 


where || -|| denotes matrix norm (|| A||? = ¥; }; a} for A = [a;]). 
For small 6, A is very close to Ap and ya is large. Accordingly, the 
distance D(Xo, A) as a function of ya is plotted in Fig. 2b. (Note that 
small values of 6 in Fig. 2a correspond to large values of ya, in Fig. 
2b—i.e., the direction of the curves is reversed.) 

Similarly, we study the effect of changes in only the observation 
probability matrix B. The sequence of models \ = (u, A, B) for 
comparison is formed by setting u = Up = v, A = Ay = W, and 


YA > 10 logio (9) 
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1 6 
B= (4) W,+ (4) W,, (10) 


again, with 6 varying from 0.001 to 0.991. The relationships, D(Ao, X) 
versus 6 and D(Ag, ) versus yp, 


|| Bol? 


——__—; , 11 
1B. - BIE (1) 


YB = 10 logio 


are shown as the upper curves in Figs. 2a and b, respectively. 

Both curves of Fig. 2b show a simple monotonic exponential trend 
for the example studied. This exponential trend may be intuitively 
anticipated from eq. (2), which shows that y is in the form of a product. 
This monotonic relationship is, in general, true when the signal-to- 
noise ratio is adequately high, i.e., models are close enough in the 
Euclidean distance sense. This result is consistent with Theorem 3.19 
in Ref. 7, which gives the set G(Ao) a geometric interpretation. For 
more complicated models or other types of deviations than those of 
eqs. (8) and (10), however, the simple monotonic exponential relation- 
ship of the type shown in Fig. 2b may not be observed in low signal- 
to-noise ratio regions. 

Another important property of the distance measure, as seen in Fig. 
2, is that deviations in the observation probability matrix B give, in 
general, larger distance scores than similar deviations in the state- 
transition matrix A. Thus, the B matrix appears to be numerically 
more important than the A matrix in specifying a hidden Markov 
model. It is our opinion that this may be a desirable inherent property 
of hidden Markov models for speech recognition applications. 


3.3 Examples of the use of the distance in model estimation 
3.3.1 Ergodic models 
Consider the following models: 
1. Ag: N = M = 4, balanced model 


0 0 05 0.5 
05 0 O 0. 


A=105 05 0 0 | 
0 05 05 0 
05 05 0 0 0.25 
pu|9 05 05 0 | | _ | 0.25], 
0 0 05 05}? 0.25 |? 
05.0 0 05 0.25 


2. ry»: N = M = 4, skewed model 
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0 0 0.25 0.75 
0.15 0 0 0.85 
0.2 08 0 0 , 
0 0.22 0.78 0 


Ae 


0.25 0.75 0 0 0.25 
pa} 015 0.85 0 na 0-25]. 
0 0 O12 09/7 0.25 | 
0.2 0 O 08 0.25 


3. A: N = M = 5, deterministic observation 


0 08 0.1 O12 O 

0 05 05 0 O 
A=|04 0 02 0 04], 

0.3 0.2 0.1 04 0 

0.2 01 02 04 0.1 


100 0 0 1 
0100 0 0 
B=/0 010 0], u=j]0 
0.001 0 0 
00001 0 


It should be pointed out that A, is a balanced model in which transi- 
tions as well as observations are equiprobable within the structural 
constraints, while \, is a skewed model with the same Markov chain 
structure as \,. Model d, has a unique observation probability matrix, 
namely the identity matrix, which links observations to distinct model 
states. _ 

A number of observation sequences, O7, of different duration were 
generated from these models. Then, for each Or sequence, a model 
estimate, generically denoted as dj, Aj, or AZ, was obtained using the 
reestimation algorithm, which, starting from an arbitrary guess, iter- 
ated until a certain convergence criterion was met.’ 

Each sequence Or, of duration T thus corresponds to a model 
estimate for which the divergence distance can be evaluated from the 
generating model. Figures 3a, b, and c are plots of D(X, Aé), D(As, AZ), 
and D(d,, Az) respectively, as a function of the duration T. These 
figures display typical simulation results of the statistical reestimation 
technique. Important considerations behind the simulation process 
include: (1) characteristics of the generating source, such as X being a 

balanced or skewed model; (2) effectiveness of the estimation tech- 
~ nique; and (3) the number of observations needed for a good estimate. 
Here we provide qualitative discussions of the plotted results. 
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Fig. 3—Distance performance of reestimated ergodic models as a function of the 
observation duration: (a) X,, balanced model; (b) As, skewed model; (c) A,, model with 
deterministic observation. 

Figure 3a indicates that the distance between A, and Aj stabilizes 
after T grows beyond about five hundred samples. The distance for 
T > 500 is small (about 0.085), with a range of statistical variation 
between +0.025. The distance scores of Fig. 3b do not seem to be as 
well behaved as those of Fig. 3a. Although the estimate Aj for 4» may 
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be as good as NZ, judging from the distance, \f appears to be more data 
dependent. A slow drifting from D = 0.04 at T = 1000 ~ 2000 region 
to D = 0.07 at T = 3000 ~ 4000 region is seen. This can be attributed 
to the fact that , is a skewed model and the associated measure 
u(-| Ay) has a slightly wider dynamic range than p(- | A,); hence devia- 
tions in D are manifested over a broad range of values of T. Those 
generated sequences, O7, of high u(Or|X,) will result in a close 
estimate \j, and the wide dynamic range in p(-|A,) will directly 
translate into the observed variations in D(A», Ag) for long observation 
sequences. This long-term drifting of D is reminiscent of the residual 
difference between uncorrelated and highly correlated sources in sta- 
tistical data analysis. 

The results of Fig. 3c indicate that when the generating source 
involves only a Markov chain and does not have variations in the 
observation density, very good estimates can be obtained with a small 
amount of data. Also, the B matrix, because it is an identity matrix, 
greatly narrows the range of y(- | A.), resulting in negligible variations 
in D(A., AZ) when Or is sufficiently long. 


3.3.2 Left-to-right models 


Another series of simulations dealt with nonergodic models of the 
types shown in Fig. 4. These models are identical to the three models 
SRC195, SRC295, and SRC395 studied in Ref. 1. We denote these 
models by Aj95, Acgs, aNd Azgs as in Fig. 4. For these models, N = 5, 
M = 9, u* = [1000 0], and 


0703 0 0 00 0 0 0 

0 0 08 02 00 0 0 O 
B=|0 0 0 0 310 0 0 90 

0 0 0 90 0 02 08 0 O 

0 0 0 0 00 0 03 0.7 


An additional model, A595, which had the same state transition proba- 
bility and initial state probability as A; but with the following B 
matrix 


0.8 01 01 0 0 O 
0 O01 08 01 0 £O 
0 O 01 08 01 O 
0 0 O O01 08 O.1 
0 0 0 O O01 08 01 


ooo & 


was also studied. 

The observation matrix B for )j95 through A395 is non-overlapping; 
observations generated during one state cannot appear during another 
state. As was the case for model i, in the previous section, these 
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0.8 0.2 0.8 0.1 1.0 
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Fig. 4—Left-to- right hidden Markov models: (a) Ais, (b) Aves and Ages, and (c) Aags, 
used in the study of model parameter estimation sensitivity. 


models show a rigid correspondence between states and observations. 
In this case, we can observe some particular effects of the coupling 
between matrices A and B upon model estimates as well as the 
distance. 

The same simulation procedure as above was followed: (1) observa- 
tion sequences were first generated; (2) model estimates were then 
obtained using the reestimation algorithm with different initial 
guesses; and (3) distances between the generating model and the 
estimated model were calculated and plotted as a function of T, the 
total sequence duration. (Note that because of the trap state in these 
models, the measurement sequence is a series of restarted sequences 
and T is the total duration.) Four kinds of initial guesses of model 
parameters were used. Type 1 is a totally random guess (except for 
the necessary stochastic constraints). Type 2 is a random guess with 
known state-transition constraints; that is, elements in A correspond- 
ing to prohibited transitions are initially set to null, while others are 
randomly chosen with stochastic constraints. Type 3 is a random guess 
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Fig. 5—Computed model distance versus number of observations, for the four types 
of parameter initial guesses, for model Ajo. 


with both known state-transition constraints and known state-obser- 
vation constraints, so in the initial matrices A and B, those elements 
corresponding to prohibited transitions and impossible observations 
are set to null. Type 4 is the generating model itself. Type 4 is useful 
for studying the convergence properties of the reestimate algorithm 
itself, since the sequence is unlikely to display complications often 
observed in sequences that converge to a local optimum. 

A set of curves showing the measured distances versus the number 
of observations, for the four types of initial guess, are plotted in Figs. 
5, 6, 7, and 8, corresponding to Aj95, Ags, As95s, and A595, respectively. 
Figure 5 shows that for model Aj95, the model estimates with Type 1, 
3, and 4 initial guesses quickly converge to the generating model Ajpos, 
i.e., the distances became essentially 0. Note that 195 has a highly 
constrained structure with high probability of staying in the current 
state. This, combined with the fact that B is non-overlapping, says 
that this source would most probably produce observation sequences 
in which the corresponding state sequence is well defined, and the 
duration of each state (as determined from the A matrix) is unlikely 
to differ from one another dramatically. Type 2 initial guesses main- 
tain the same Markov chain structure, but with random transition 
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Fig. 6—Computed model distance versus number of observations, for the four types 
of parameter initial guesses, for model )g95. 


probabilities different from the generating source. In fitting the obser- 
vation sequence to a Type 2 initial estimate, the constrained A matrix 
is changed very little in the reestimation process, which instead mainly 
extends and modifies the B matrix to include, in one state, different 
observations that originally occurred in different states. Depending on 
the initial guess values, the resultant B matrix may be significantly 
overlapped, thereby leading to a significant distance from the gener- 
ating source. Figure 5 shows that this analysis is indeed the case for 
model digs. With Type 3 initial guesses, the initial constraints in 
matrices A and B are retained through the reestimation process, and 
optimization of the A matrix is independent of that of the B matrix. 
Furthermore, optimization of the B matrix, in the current case, is 
carried out independently for each state, because with the initial 
constrained B matrix, the underlying state sequence is immediately 
known. Therefore, the results from using Type 4 initial guesses are 
virtually identical to the results from using Type 3 initial guesses, as 
shown in Figs. 5 through 8, for all models that were studied. 

For A295, trends similar to those of Fig. 5 are observed and shown in 
Fig. 6, but some problems due to the allowed state-skipping transitions 
are observed for Type 1 initial parameter estimates. As explained 
above, estimated models based upon Type 2 initial guesses are at a 
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Fig. 7—Computed model distance versus number of observations, for the four types 
of parameter initial guesses, for model Aggs. 


significant distance from the generating source. With Type 1 initial 
guesses, the converged results are only slightly better than those with 
Type 2 guesses. Type 3 and 4 initial guesses lead to virtually the 
optimal estimate, i.e., models with essentially zero distance from the 
generator. 

The results using model A395 are shown in Fig. 7. Model A395 has a 
state-transition structure similar to that of A295, but with different 
transition probabilities. The fact that az. = 0.2 and ag, = 0.1 in Aaggs 
makes it essentially a three-state model (i.e., two of the states are 
highly transient). Again, the dependence of model estimates upon the 
type of initial guess is similar to what is mentioned above, except the 
distances now are smaller than those obtained using )o95. However, as 
seen in Fig. 7, when the total duration is small (i.e., 244 samples), the 
estimated models are at a significant distance from the generating 
source, regardless of the initial guess. This is because states 2 and 4 
are not well represented in the observation sequences. The sudden 
drop of distance for Type 3 and 4 initial guesses at T ~ 1150 samples 
indicates that the transient states of the generating model are suffi- 
ciently well represented for T' > 1150, and with proper initial guesses, 
an estimate virtually identical to the generating source can be ob- 
tained. 
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Fig. 8—Computed model distance versus number of observations, for the four types 
of parameter initial guesses, for model Ago5. 


The results for model )s95 are given in Fig. 8. Since source As95 has 
an overlapping observation matrix B, many of the phenomena that 
occurred in A495, Az9s, and A395 no longer appear. Indeed, as shown in 
Fig. 8, estimated models of nearly zero distance from the generating 
source have been obtained regardless of the initial guess, provided the 
observation sequences are sufficient in duration. The effects of initial 
guess are manifested only in the way the estimate converges as T 
grows. 

Figures 5, 6, 7, and 8 not only provide results pertaining to the 
performance of model estimates and its relationship to model initial- 
ization as well as observation length, but also show the effectiveness 
of the distance measure of eq. (6) in measuring the dissimilarity 
between any pair of hidden Markov models. 


IV. CONCLUSION 


We have defined a probabilistic distance measure for hidden Markov 
models. The measure is consistent with the probabilistic modeling 
technique and can be efficiently evaluated through Monte Carlo pro- 
cedures. The distance measure was employed in the study of relative 
parameter sensitivities as well as the relationship among model esti- 
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mate, initial guess for the reestimation algorithm, and the observation 
sequence duration for discrete density hidden Markov models. Much 
of the behavior of hidden Markov models and the reestimated results 
have been observed through the use of such a distance measure. The 
study in turn confirms the effectiveness and reliability of the distance 
measure. Potential applications of the distance measure may include 
hidden Markov model selection as well as clustering. 


REFERENCES 


1. S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, “An Introduction to the 
Application of the Theory of Probabilistic Functions of a Markov Process to 
Automatic Speech Recognition,” B.S.T.J., 62, No. 4 (April 1983), pp. 1035-74. 

2. L. E. Baum et al., “A Maximization Technique Occurring in the Statistical Analysis 
os Probabilistic Functions of Markov Chains,” Ann. Math. Statist., 41 (1970), pp. 
164-71. 

3. J. H. Conway and N. J. A. Sloane, “Voronoi Regions of Lattices, Second Moments 
of Polytypes, and Quantization,” IEEE Trans. Inform. Theory, [T-28 (March 
1982), pp. 227-32. 

. Y. Linde, A. Buzo, and R. M. Gray, “An Algorithm for Vector Quantizer Design,” 
IEEE Trans. Commun., COM-28 (January 1980), pp. 84-95. 

. S. Kullback, Information Theory and Statistics, New York: Wiley, 1958. 

. R.M. Gray et al., “Rate-Distortion Speech Coding With a Minimum Discrimination 
Information Distortion Measure,” IEEE Trans. Inform. Theory, [T-27, No. 6 
(November 1981), pp. 708-21. 

. T. Petrie, “Probabilistic Functions of Finite State Markov Chains,” Ann. Math. 
Statist., 40, No. 1 (1969), pp. 97-115. 

. L. R: Rabiner et al., unpublished work. 

. L. E. Baum and J. A. Eagon, “An Inequality With Applications to Statistical 
Estimation for Probabilistic Functions of a Markov Process and to a Model for 
Ecology,” Bull. AMS, 73 (1967), pp. 360-3. 

10. P. Billingsley, Ergodic Theory and Information, New York: Wiley, 1965. 


~] ao 


<o 00 


AUTHORS 


Biing-Hwang Juang, B.Sc. (Electrical Engineering), 1973, National Taiwan 
University, Republic of China; M.Sc. and Ph.D. (Electrical and Computer 
Engineering), University of California, Santa Barbara, 1979 and 1981, respec- 
tively; Speech Communications Research Laboratory (SCRL), 1978; Signal 
Technology, Inc., 1979-1982; AT&T Bell Laboratories, 1982; AT&T Infor- 
mation Systems Laboratories, 1983; AT&T Bell Laboratories, 1983—. Before 
joining AT&T Bell Laboratories, Mr. Juang worked on vocal tract modeling 
at Speech Communications Research Laboratory, and on speech coding and 
interference suppression at Signal Technology, Inc. Presently, he is a member 
of the Acoustics Research Department, where he is researching speech com- 
munications techniques and stochastic modeling of speech signals. 


Lawrence R. Rabiner, S.B. and S.M., 1964, Ph.D., 1967 (Electrical Engi- 
neering), The Massachusetts Institute of Technology; AT&T Bell Laborato- 
ries, 1962—. Presently, Mr. Rabiner is engaged in research on speech com- 
munications and digital signal processing techniques. He is coauthor of Theory 
and Application of Digital Signal Processing (Prentice-Hall, 1975), Digital 
Processing of Speech Signals (Prentice-Hall, 1978), and Multirate Digital 
Signal Processing (Prentice-Hall, 1983). Member, National Academy of En- 
gineering, Eta Kappa Nu, Sigma Xi, Tau Beta Pi. Fellow, Acoustical Society 
of America, IEEE. 


408 TECHNICAL JOURNAL, FEBRUARY 1985 


AT&T Technical Journal 
Vol. 64, No. 2, February 1985 
Printed in U.S.A. 


A Conditional Response Time of the M/M/1 
Processor-Sharing Queue 
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In this paper we examine the distribution of the response time of an arriving 
customer conditioned on the number of customers present in an M/M/1 
processor-sharing queue. We show that the rth moment of the distribution is 
a polynomial in the number of customers present, and obtain a recursion for 
its determination. The Laplace-Stieltjes transform of the conditional distri- 
bution is also obtained. We find an expansion in powers of the arrival rate, 
which permits accurate computation when the utilization is not too close to 
one, and also give an asymptotic expansion when the number of customers in 
the system is large. This permits assessment of response time in a heavily 
loaded system. We present numerical results using these methods and discuss 
their relative merits. 


I. INTRODUCTION 


The behavior of many computer systems can be approximated by 
the processor-sharing discipline. In this discipline, the server (CPU) 
operates at a rate of u, and whenever 7 customers are present, each 
customer receives service at a rate of u/i. We assume that the arrivals 
occur according to a Poisson process at a rate of \ and that each 
customer’s service requirement is exponentially distributed. 

Of interest to us is the response time conditioned on the number of 
customers seen by an arriving customer, including itself. The response 
time is the elapsed time between arrival and departure for a customer. 
In this paper, we show that the rth moment of the conditional response 
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time is a polynomial of degree r in the number of customers seen by 
the arrival, including itself. Further, we show several methods of 
obtaining the distribution of the conditional response time in the form 
of a Laplace-Stieltjes transform. 

The results presented in this paper may be useful in the design of a 
computer or switching system. For example, the designer of such a 
system may ask, “What is the minimum number of customers that 
must be present in the system before the 95th percentile of the 
response time exceeds some prespecified threshold?” The answer to 
this question gives some indication of the performance of the system 
under overload. It may be instrumental in deciding whether any special 
overload control mechanism is needed. If an overload control mecha- 
nism is needed (i.e., the arrivals are blocked if a certain number of 
customers are in the system), the analysis of such a queue is straight- 
forward and will not be discussed in this paper. 

This problem has been studied in greater generality by Coffman et 
al.,. who obtain the waiting time distribution conditioned on the 
number seen by an arrival and the amount of service required by the 
arriving customer. However, it is difficult to obtain all the results of - 
this paper directly from those in Ref. 1. Related work in this area was 
done by Sakata et al.,? who characterized a solution for the M/G/1 
processor-sharing model. Recently, Ott? and Ramaswami‘* have pro- 
vided methods for characterizing the unconditional distributions of 
the response time for the M/G/1 and GI/M/1 queues, respectively. 
The unconditional distribution of the response time for the M/M/1 
queue has also been solved by Morrison.’ Rege and Sengupta® have 
solved a version of the M/M/1 processor-sharing model, which in- 
cludes multiprogramming. 

In Section II of this paper we derive the conditional moments of the 
response time. In Section III, we obtain the Laplace-Stieltjes trans- 
form of the distribution of conditional response time. In Section IV, 
we obtain an expansion for the transform in powers of \, which is 
useful for computation when ) is not large. In Section V, the asymp- 
totic expansion is given for x — ©, This is especially useful for 
computation in a heavily loaded system. In Section VI, we discuss the 
numerical issues of this problem. 


ll. MOMENTS OF THE CONDITIONAL RESPONSE TIME 


Let X denote the number of customers seen by an arrival, including 
itself, and let T be the response time of this arrival. Let u(x, s) be the 
Laplace-Stieltjes transform of the distribution of the conditional re- 
sponse time, ie., u(x, s) = E(e“7|X = x) for x = 1, 2, ---. By 
conditioning on the first event (defined as the next arrival or departure, 
whichever occurs first), we obtain 
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Let u,(x) be the rth moment of T conditioned on X = «x, i.e., u,(x) = 
E(T’|X = x). Then by taking the rth derivative of (1) and (2) with 
respect to s, multiplying by (—1)’ and setting s to zero, we obtain 


(x + 1l)u-(x + 2) — (1 + b)(x + Lpe(x + 1) + bxp,(x) 
= —r(x + L)p-i(x + 1)/A_ (8) 

forx=1,2.-- 

(2) — (1 + b)u-(1) = —rpr—i(1)/A (4) 
and 

Ho(-) = 1. 
In eqs. (3) and (4), b = p/d. 
Theorem 1: The solution of (3) and (4) has the form: 
u(x) = ¥ ays! for x=1,2,---; r=1,2,--- 


in which the coefficients a,; satisfy: 


r-1 
ay = 12 (- ra,—1,r0r,j/ A- Or,k+1Ck+1,j)— ra;—1,j~-1/ sf Ci 


k= 


for jar-len, le r=1- 


13 (ra,—-14/ + (Grp+i(2"*! — 1 - on} bo 


k=0 


ro 


for r=1,2,.--- 
doo = 1 


and 
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Proof: We first write u,(x) as a power series of the form 


u(x) = Y a,x! 
j=0 


and note that 


w(x + m) = Dx) Y aaklm/(j(k — 7). 
j=0 k=j 

We substitute this in (3) and arrange the left- and right-hand sides as 
a power series in x. For j > r, we find that the terms containing x’ on 
the left-hand side vanish. For any j > r, the coefficient of x/ contains 
a,, for k = j, and these coefficients can be chosen arbitrarily. It is, 
therefore, not inconsistent to choose a,; = 0 whenever j > r. The rest 
of the theorem follows by equating coefficients of x’ on the left- and 
right-hand sides of (3) forj =r, ---, 0 and verifying that the boundary 
condition is satisfied. DO 


Corollary 1: For r= 1 and 2, 





(x + 1) 
a(x) = (2b — D 
and 
2(x + 1) (: + = 7 
al) = ia — 1) — 2) 
for 


x=1,2,.---. 


We observe that the mean conditional response time is a linear 
function of x. This particular result is true for the M/M/1 First Come 
First Served (FCFS) queue as well, for which p(x) = x/p. It is readily 
seen that the mean conditional response time for the processor-sharing 
queue is smaller than that for the FCFS queue if and only if 


x-1>L, 
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where L is the mean number in the system. Thus, the processor- 
sharing queue has a smaller mean response time whenever the number 
of customers seen by an arrival (excluding itself) is greater than the 
mean number in the system. 

This result parallels the known result about the mean response time 
conditioned on the service requirement (see Ref. 7). There, the pro- 
cessor-sharing queue has a smaller mean response time whenever the 
service requirement is less than the mean service time. 


ll. DISTRIBUTION OF THE CONDITIONAL RESPONSE TIME 


The difference equation for the Laplace-Stieltjes transform of the 
response time distribution, u(s, x), is repeated here for convenience: 


(x — 1)u(x) — a(x — 1)u(x — 1) 
+ b(x — 2)u(x — 2) = -b, x = 2, 
= = = 
are a b+1+y. (5) 
The explicit indication of the dependence of u(s, x) on s has been 
suppressed. An exact solution for the Laplace transform suitable for 


numerical inversion will be obtained.’ Only one boundary condition is 
given explicitly in (5), namely, for x = 2, one has 


u(2) — au(1) = —6; (6) 


however, another boundary condition is available through the require- 
ment that u(s, x) be a transform as a function of s. This, of course, 
will be satisfied by the solution to be obtained. 

Designate by Lu the left-hand side of (5) so that the difference 
equation to be solved is 


Lu = —b; (7) 


this will be referred to as the complete equation. A function v(x) will 
now be found satisfying 


Lv = 0. (8) 


Equation (8) is referred to as the homogeneous equation and v(x) asa 
complementary solution. Laplace’s method will be used to solve (8).° 
Assume a representation of v(x) of the form 


atx) = | etridr (9) 


in which the function g(r) and a path of integration in the complex r- 


CONDITIONAL RESPONSE TIME 413 


plane are to be chosen. Substitution into (8) and subsequent integra- 
tion by parts yields 


Lu = (7? — ar + b)r%g(7) | 


+ i (73 — ar? + br) g(r) + 7°g(r)Jdr. (10) 


The vertical bar designates evaluation of the concomitant on the path 
to be chosen, and the dot indicates d/dr. To satisfy (8) the differential 
equation 





(7? — ar + b)g(r) + ra(r) = 0 (11) 
is to be solved for g(7). Let the roots of 
r?—art+b=0 (12) 
be 
a — Va? — 4b a+ Va? — 4b 
Se = ee a a 
and let 
w= ota} (1+ ha); (14) 
then the solution of (11) may be written as 
a(t) = (y — 7) (m1 — 7). (15) 


Since, for real s, 0 < y < 4, the path of integration in (9) is chosen 
as the segment (0, y) of the real axis; for this choice, the concomitant 
term of (10) vanishes and, hence, (8) is satisfied. Thus, one has 


v(x) = i a My — 7)* (n — 7) *dr. (16) 


Setting z = y?/b = y/7 and changing the variable of integration allows 
v(x) to be written in the form 


v(x) = y* 12" f ro — rt) (1 — 7rz)dr. (17) 


Comparison of (17) with the integral form of the hypergeometric 
function, F(a, b; c; z) shows that (see Ref. 10) 
w1 P(a)T (x) 


Tats) 2°F(a, x; a + x; 2). (18) 


v(x) = ¥ 
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This is particularly advantageous since one has 
~ (a);(x);2/ 
Fla, x3 a+x;2z) = ), -———,, 19 
eae? ) Peay it 
in which 
(a)o = 1, (a); =a(a+1)--- (at+j-1), jJ21. (20) 
It may be observed that the series of (19) converges absolutely for 
|z| <1, that is, for y? < b which holds. 
The computation of F in (19) may be simply carried out by the 
recursions 
Fo = 1, F; = Fj-1 + T;, 
ele y= Det+7 =) 


Tyo = 1 de - : 
7 : (atx+j-1)j 


or, (21) 


Similarly, since 


T(o)F(x) (x — 1)! 


(atx) (a), ’ (22) 


this too may be easily computed recursively. It may be considered, 
therefore, that the computations that will be needed in the inversion 
procedure to be used may be conveniently performed for the function 
v(x). . 

To solve the complete eq. (7), its order will be depressed. It is 
convenient to write it in the form 


Lu = (x + 1)u(x + 2) — a(x + 1)u(x + 1) + bxu(x) = —b. (28) 
Let 
u(x) = v(x)7(x); (24) 
then 
Lu = L(ur) = (25) 


(x + 1l)u(x + 2)7r(x + 2) — a(x + 1)v(x + 1)7(x + 1) + bxv(x)7(x). 
Use of the formulae 


T(x + 1) = r(x) + Ar(x), 
r(x + 2) = r(x) + 2Ar(x) + A?r(x) (26) 
in (25) yields 
Lu = (x + 1)v(x + 2)A?r 
+ [2(x + 1)v(x + 2) — a(x + 1)v(x + 1)]Ar. (27) 
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Setting 


w(x) = Ar(x) 
in (27) now provides the first-order equation 
; b 
w(x + 1) — R(x)w(x) = - (e+ lee 9)’ 
_ _ v(x + 1) _ 
R(x) =a GED) + 2) 1. 


Direct substitution verifies that 
= b 
w(x) = > 
is a solution of (29). 
To show that w(x) converges, one has 
(1 — rz)™* ~ (1 -— 2), tT—> 1-; 


hence, from (17), 


a 1 
v(x) ~ y* (; = ) i r(1 — 7) dr, 


z , T(a)I(x) 
1-2) I(a+x)’ 








v(x) ~ 7" ( 


Further, since 


I(x) a 
I(a + x) ae 


x —> 0%, 


one has 


ue) ~ 7 (2) ree X —> 00; 


thus 


Reyes S41, x—> @, 
Y 


jan (x + j)u(x +7 + IR(x)R(x + 1) --- R(x +7 - 1) 


(28) 


(29) 


(30) 


(31) 


(32) 


(33) 


(34) 


(35) 


(36) 


Evaluation of the jth term comprising w(x) now shows that it is of 


the order 


eal vV 
J b . 


(37) 


Since yy; = b and 71 > 7, one has y < Vb, thus y < b if b > 1, that is, 
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if the offered load b~! = \/p < 1. In this case the series for w(x) 
converges. 

A one-parameter family of solutions of (7) obtained from (28) and 
(24) is 


u(x) = v(2) |p +y via}. (38) 


in which D is a constant that is determined by use of the boundary 
condition (6). The final result is 


_ a ob + w(1)v(2) 
u(x) = v(x) Bb w(j) — Fy — avi) |" (39) 
In order that u(x)/s be a Laplace transform, one must have 

lim ae = 0, (40) 


which is verified by (39). Accordingly, (40) is the second boundary 
condition required to specify a unique solution of (7). That this is true 
follows from an examination of the complete solution of (7), which 
will not be done here. 


IV. PERTURBATION IN 
For this purpose (5) is written in the form 
Ma + Lula + 2) — (wuts + Mx t+ Dulx + 1) 
+ pxu(x) = —p, (41) 
and the boundary condition in (6) takes the form 
Au(2) — (u +s + A)U(L) = —p. (42) 


Fortunately this constitutes a singular perturbation. Writing u(x) in 
the form 


foo} 


u(x) = ¥ u(x)’ (43) 


jJ=0 


and substituting into (41), (42) yields the equations 


~-1 


-1 
(x + Wuo(x +1) - ( = :) xuo(x) = ( + ‘) 
Le a 


u(1) = ( rs ‘) (44) 
be 


and 
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(x + 1)u;(x + 1) - ( + ‘) xu;(x) 


; =i 
= (1 as ‘) (x + 1)Auj-1(x + 1), 
ul ra 


uj(1) = (1 x <) Au;-1(1). (45) 
Mm rm 


These are first-order equations for u(x) and u;(x), which present no 
difficulty of solution. One obtains 


U(x) a - ( +2)", 


1 x f—-x-1 . 
u(x) =— > ( + ‘) ZAuj-1(2), J 21. (46) 
MX 7=1 L 


V. ASYMPTOTICS FOR x— © 


The derivation of the asymptotic expansion of u(x) depends on the 
operational method of Boole,’ Jagerman,” and Milne Thomson (see 
Ref. 9); it is somewhat involved, so the details will be omitted but the 
results may be easily stated. One may write 


oo 


st 


OP 2 ene et PD a 
in which the coefficients are given by 
db 7b 
a) =~; a2 - —- "9 > 
s s 
\°b(b — 2) \*b(b — 2)(2b — 3) =—.2d°b? 
ee a TS as ae ema CTL (48) 


and, in general, by the recursion 
ons r,. 5 : 
a=, (7 — 2)(2 — a) + Iaj-1 + = — 2)’aj-2, J22. (49) 


Let F(z, x) be the complementary distribution corresponding to 
u(x); then 
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VI. NUMERICAL RESULTS 


Our paper was concerned with exact answers for the moments 
(Section II), answers in the form of Laplace-Stieltjes transforms 
(Sections III and IV) for the distribution function, and asymptotic 
answers (Section V) for large x. One issue that we were concerned 
with was the applicability of these methods for ranges of parameter 
values of the problem. 

In Table I, we present the first and second moments of the sojourn 
time calculated by three different methods (Sections II, III, and IV). 
The value of » was chosen to be 1, x was taken to be 3, and \ was 
varied from 0.2 to 0.9. We numerically inverted the transforms ob- 
tained in Sections III and IV by the method of Ref. 8 and computed 
the moments from the distribution. As can be seen from the results, 
the results from the two distributions are accurate for the first moment 
for moderate and low value of }. The method of Section III seems to 
be slightly better than the method of perturbations for a large value 
of X. For the second moment, the results seem to be slightly less 
accurate. We state that we calculate the moments by numerically 
integrating the complementary distribution. Thus, truncation will 
cause the calculated moments to be underestimated. This fact is borne 
out in all the examples. In Fig. 1, we show the complementary distri- 
bution of the sojourn time by inverting the transforms from eq. (39). 

In Table II, we show the accuracy of the asymptotic results presented 
in Section V. Here we calculate the complementary distribution when 
\ = 0.5, w = 1, and x = 10. As can be seen, there is good correspondence 
between the perturbation method (which uses the numerical inversion 
technique of Ref. 8) and the asymptotic solution. 


Table I—Numerical results for the first and second moments 


(x =3, n= 1) 
Mean = p(x) Second Moment = p2{x) 
Exact (Sec- From From Exact (Sec- From From 
d tion II) eq. (39) eq. (46) tion IT) eq. (39) eq. (46) 
0.2 2.222 2.216 2.216 8.927 8.659 8.659 
0.5 2.667 2.636 2.636 15.111 13.758 13.766 
0.9 3.636 3.619 3.384 40.220 35.664 25.243 
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COMPLEMENTARY CONDITIONAL DISTRIBUTION 
x=3 
wei 


E(7,x) 





0 2 4 6 8 10 12 #14 16 18 #20 22. ~°+«24 
Fig. 1—Complementary conditional distribution of the sojourn time. 
Table II—Numerical results for the asymptotic 


solution (A = 0.5, uw = 1, x = 10) 
F(r, x) = P(T > +|X = x) 


From Asymptotic From eq. (Pertur- 
T Solution (50) bation) (46) 
1.0 0.9023 0.9023 
2.0 0.8092 0.8092 
3.0 0.7207 0.7209 
4.0 0.6370 0.6376 
5.0 0.5580 0.5596 
6.0 0.4839 0.4874 
7.0 0.4147 0.4215 
8.0 0.3504 0.3621 
9.0 0.2912 0.3094 
10.0 0.2370 0.2630 


In conclusion, we would recommend the method of Section III for 
small values of x; the asymptotic solution of Section V for large values 
of x; and the perturbation method of Section IV, where x takes a wide 
range of values and when )d is not large. 
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A Study on the Ability to Automatically 
Recognize Telephone-Quality Speech From 
Large Customer Populations 


By J. G. WILPON* 
(Manuscript received August 7, 1984) 


To ascertain whether a speaker-independent word recognition system, using 
current technology, could function in normal telephone environments, it was 
necessary to conduct a study under such real-world conditions. Such an 
experiment was described by Wilpon and Rabiner (1983), in which telephone 
customers, speaking under ordinary telephone conditions, in Portland, Maine, 
were asked to speak their telephone number as a sequence of isolated digits. 
For each customer a maximum of four digits were obtained. The results from 
that study were very encouraging and led to further improvements in our 
recognition systems. To further study the feasibility of implementing speech 
recognition systems for general use over the telephone network, another field 
study was initiated. In this test, spoken seven-digit telephone numbers were 
obtained from a large number of telephone customers over a variety of 
transmission facilities in Baton Rouge, Louisiana. This paper presents the 
results of several recognition experiments performed on this database. Exper- 
iments were also carried out quantifying the robustness of template sets 
created in Portland, Baton Rouge, and under laboratory conditions in our 
Murray Hill, New Jersey, laboratory. Finally, a recognition system that 
incorporates syntactic information available in a seven-digit telephone is 
discussed. Our tests indicate a number of distinct real-world problems that 
must be considered when implementing a speech recognition system for 
widespread use. A discussion of the overall results and the implications for 
future research will be given. 
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Il. INTRODUCTION 


The development of a speaker-independent speech recognition sys- 
tem that performs well over dialed-up telephone lines has been a goal 
of AT&T Bell Laboratories for close to a decade.'* However, until 
recently all evaluations of our recognition systems have been based on 
laboratory recording conditions. These conditions typically consisted 
of cooperative subjects using local dialed-up lines over a Private 
Branch Exchange (PBX). Peak signal-to-average noise ratios under 
these conditions generally ranged from 40 to 60 dB. Using such local 
switched lines, the performance of the speech recognition algorithms 
tested was found to be quite good for a wide range of vocabulary sizes 
and complexities and for a wide range of talkers. 

An earlier effort was made to test the viability of our speaker- 
independent, isolated word recognition systems on a very large tele- 
phone customer population.® The task was conducted under “real 
world” conditions, i.e., asking telephone customers to speak their 
telephone numbers in a home environment over randomly dialed-up 
lines in Portland, Maine (PO). Under these conditions, signal-to-noise 
ratios (s/n) of between 8 dB and 60 dB were encountered. The results 
of this study yielded a recognition accuracy of 93.1 percent. While this 
was not as high an accuracy as was achieved in the laboratory,” given 
the transmission medium and the problems associated with obtaining 
isolated digit strings over standard telephone lines, these results were 
extremely encouraging. 

There were several other shortcomings associated with our previous 
study. First, the wide variety of transmission and switching conditions 
made it very difficult to detect the spoken words automatically. Sec- 
ond, for privacy, our database consisted of at most the last four digits 
of the customer’s telephone number. Because of this, parts of the first 
digit recorded were sometimes deleted. (The digitization of the input 
speech had to be initiated by a site observer after the first three digits 
were spoken. In some cases the observer was not quick enough to start 
the recording procedure before the fourth digit was spoken.) Third, 
about 50 percent of the speech data available from recording was 
thrown away, either because it contained some connected digit strings 
or the background conditions were too severe. 

Another problem that existed in our earlier study was getting casual 
telephone customers to speak their phone number as a sequence of 
isolated digits. This was related to human factors issues, that is, people 
do not normally speak in an isolated word format. 

As a result of the problems that were encountered during our initial 
exercise in the “real world,” we found that we were testing our 
recognition systems on only a small percentage of all the speech data 
to which we had access. The purpose of this paper is to describe a new 


424 TECHNICAL JOURNAL, FEBRUARY 1985 


data collection exercise that was carried out to more accurately deter- 
mine our speech recognition system’s capabilities over randomly 
dialed-up telephone lines. Over a two-week period we recorded all 
customer information from approximately 7400 callers. No calls were 
eliminated, and all seven digits were recorded. 

The database was collected over randomly dialed-up telephone lines 
at an AT&T centralized switching office in Baton Rouge, Louisiana 
(BR). The customers that participated would normally speak their 
telephone number to an operator. That is, the subjects were performing 
a task that they had done before, except that now input was to be 
given in an isolated fashion. Special-purpose hardware® was attached 
to one operator console, which automatically answered a call and 
asked the customer to speak his phone number as a series of isolated 
digits. The hardware also cataloged the caller’s transaction, and digi- 
tized and stored the customer’s speech on magnetic tape. 

There are several very important issues that need to be studied 
before speech recognition can be made available to large telephone 
user populations. The most important issue is end-to-end system 
recognition accuracy. That is, if over time N calls are received by the 
system and must be handled, what is the percentage of the N calls 
that will be able to go through the system automatically without any 
failures? Such failures include the caller hanging up, word endpoint 
problems, isolated input problems, and the possibility that a human 
operator would have to intervene during the course of the transaction 
(e.g., if the customer misunderstood the instructions). These issues 
are examined in detail within the text of this paper. 

Another issue that will be discussed is the robustness of speaker- 
independent templates created in one recording environment, using 
one set of talkers, and tested under different conditions with new sets 
of talkers. In past recognition studies, training data and testing data 
were collected under laboratory conditions in our Murray Hill, N.J., 
(MH) laboratory.’ The subjects for these studies were all native 
speakers of American English mostly from the New York metropolitan 
area. In our Portland study, the speech data obtained were tested 
against a speaker-independent template set created from laboratory 
speech data. The results indicated that the MH template set was 
inadequate for recognizing speech from Portland customers. With the 
addition of the Baton Rouge database, more experiments were carried 
out using speech data from BR, PO and MH. All possible combinations 
of template sets and testing conditions were tried and the results show 
the Baton Rouge template set to be quite robust for a wide range of 
talkers and over a wide range of transmission mediums. 

Although past research has shown that isolated word recognition 
systems perform adequately, the power of speech recognition lies in 
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its ability to perform a given task reliably, i.e., the word recognizer 
should be embedded within a larger system. The task can usually be 
specified as a set of simple rules that define the task syntax. The 
syntax is able to limit the possible recognition sequences at each point 
in the transaction. Several task-oriented systems have been described 
in early work, for example, a voice-controlled repertory dialer system” 
and a directory listing retrieval system.'’ For each of these systems 
the addition of syntactic constraints greatly increased recognition 
performance. 

Since past studies have shown the additional syntactic information 
to be useful, a system was constructed that incorporated knowledge 
about our task, i.e., the speaking of a seven-digit telephone number as 
a series of isolated digits. A full description of the system syntax and 
results will be presented. 

In Section II, we briefly review the results obtained from the 
Portland study. Section III gives a description of the recording pro- 
cedure used to obtain data in Baton Rouge. Section IV discusses the 
composition of the BR database. In Section V, we review some recent 
advances in speech recognition, which apply to our study. In Section 
VI we present the results from a series of recognition experiments 
performed on the BR database. The issue of template robustness is 
discussed in Section VII. A discussion of the overall results and their 
implications is given in Section VIII. 


Il. REVIEW OF PORTLAND DATA COLLECTION EXPERIMENT 


Recordings were made at an AT&T switching office in Portland, 
Maine.® A prerecorded spoken message (a prompt) was given to each 
customer requesting that he speak his telephone number as a sequence 
of isolated digits. For reasons of customer privacy we recorded only 
the last four digits of the telephone number. As each of the digits was 
spoken, a site observer entered the digits on a keyboard. The observer 
determined whether the digit sequence was spoken in an isolated 
format (i.e., spoken with sufficient pauses between words). If not, the 
observer initiated another prerecorded spoken message (a reprompt) 
requesting the user to repeat his number with a longer pause between 
digits. If the observer decided that the final speech was unacceptable 
(either because it was spoken in a connected manner or because of 
unacceptably poor telephone line conditions), a reject code was entered 
and the entire procedure was terminated for the current call. 

The recordings were bandpass filtered from 100 Hz to 3200 Hz, 
sampled at a 6.67-kHz rate, and then digitally transmitted to our 
laboratory in Murray Hill, N.J., for analysis. The log energy of the 
waveform was displayed to another observer, along with the automat- 
ically determined sets of endpoints indicating where in the recording 
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interval the isolated words could be found. At this point the second 
observer had the option of modifying any or all sets of endpoints 
computed or eliminating any digit from the string. The segmented 
speech was then entered into a database for later examination. Using 
this procedure 11,035 digits from 3100 customers were recorded over 
a 23-day period. 

Using a 3900 token subset of the PO speech data to train the 
recognizer, a 30-template-per-word reference set was created. (Several 
different template sets were tested in the PO study. The results 
presented here are for that template set that yielded the highest 
recognition accuracy.) When this set was tested against the full 11,035- 
digit database, a recognition accuracy of 93.1 percent was obtained. 

There were several problems that occurred during the recording 
phase. These were classified as being in one of two groups. The first 
group consisted of problems associated with the telephone transmis- 
sion conditions, e.g., loud static noises—probably caused by atmos- 
pheric disturbances, pops and/or clicks (switching transients), loud 
tones (mostly carrier frequency tones at 2600 Hz), and loud broadband 
“humming” noises (probably caused by a missing ground connection 
somewhere in the transmission path). Resulting peak signal-to-noise 
ratios varied from as little as 8 dB to as much as 60 dB. The second 
group consisted of problems related to the talker and the environment 
in which he or she spoke. These included nonisolation of speech (i.e., 
the digits were connected) and the presence of extraneous background 
speech. Most of these failures were severe enough to warrant elimi- 
nation of the customer’s speech from the database. This occurred for 
47 percent of all calls available for processing. 

As a consequence of the Portland study, several areas for improving 
recognition performance were discovered. Subsequently, additional 
research in speech endpoint detection algorithms” and clustering 
algorithms” (i.e., template generation procedures) was carried out. 
Results from this research have been applied throughout our BR study 
(see Section V). 


IH. RECORDING PROCEDURE USED IN BATON ROUGE 


Figure 1 shows a block diagram of the overall recording setup used 
in this study. All recordings were made at an AT&T switching office 
in Baton Rouge, La. To record customer data in an efficient manner 
special-purpose hardware and control software were required. The 
hardware included an MC68000 controller, a terminal, a 7-1/2 inch 
magnetic tape unit, A/D and D/A converters, a cartridge tape unit, 
and signal conditioning circuitry. This hardware was attached to a 
dedicated operator console, a full description of which is given in Pirz 
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Fig. 1—Block diagram of overall recording system used in the Baton Rouge study. 


and Bauer.? The sequence of events to record a single customer’s 
speech input was as follows: 

1. An incoming customer call was automatically answered by the 
Special-Purpose Hardware (SPH). A prerecorded prompt was then 
played to the customer requesting that he speak his seven-digit tele- 
phone number as a sequence of isolated digits. As the digits were 
spoken, a Site Observer (SO) keyed in the identity of the spoken digit 
string. Also, as the customer spoke, the SPH digitized the speech at a 
6.67-kHz rate with appropriate filtering applied. 

2. Once the customer finished speaking, the SO made a judgment 
as to whether the speech was spoken in an isolated format (i.e., with 
sufficient pauses between words). If not, the SO would initiate a 
reprompt requesting the customer to repeat his number with longer 
pauses between words. 
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3. After the customer completed his task, he or she was given a 
prerecorded “Thank you” message. The SO then entered an ASCII 
character string indicating any comments about the talker, such as 
sex, ability to follow instructions, etc. 

4. After the above steps were completed, the digitized speech was 
written out to the magnetic tape unit. Appropriate header information 
containing the identity of the input speech string, date and time of 
utterance, and any comments entered by the SO was also recorded on 
tape. If a reprompt had to be made, both the original and reprompted 
speech were saved. 

It should be noted that a significant number of customers abandoned 
their call without speaking their phone number. All abandoned calls 
were cataloged and will be discussed later. 

Using this procedure we recorded data from 7373 subjects (on 33 
magnetic tapes) over a two-week period. After the data collection was 
completed, the speech was read from the magnetic tapes into a Data 
General MV8000 minicomputer where all further analysis was per- 
formed. 


IV. COMPOSITION OF FINAL DATABASE 


Recordings were made for an average of six hours a day, five days a 
week, for two weeks. Tables I and II show a detailed analysis of the 
final telephone customer database. Data were collected from a total of 


Table |—Statistics on total number of calls handled 


in BR study 
Number of 
Callers Percent 
(1) Total callers 7373 100 
(2) Abandoned calls 1468 20 
(3) Net total calls (1-2) 5905 80 
(4) Operator intervention 2301 31 
(5) Unidentifiable calls 518 . 7 
(6) > 7 digits spoken 269 4 
(7) Processable calls (8-4-5-6) 2817 38 


Table II—Sex makeup of processed 
calls from BR data 


Number of 
Calls Percent 
Net total calls 5905 100 
Adult male 2137 36 
Adult female 3524 60 
Children 168 3 
Unclassifiable 76 1 
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7373 callers. Of this total, 1468, or 20 percent were callers that 
abandoned their calls, and therefore did not enter any speech data. In 
these cases the caller hung up before beginning the recording task. 
This leaves a net total of 5905 calls that yielded some speech output. 
Of the remaining callers, 2137 (36 percent) were adult males, 3524 (60 
percent) were adult females, 168 (3 percent) were children, and 76 
(1 percent) were unclassifiable. Of the 5905 useful calls, 2301 or 31 
percent required the telephone operator to cut in during the middle of 
the recording transaction. In these cases, the user got confused about 
the task he was to perform or simply did not want to cooperate. Since 
the caller had to supply his telephone number to complete his tele- 
phone call, the operator had to intervene. Generally, the user gave his 
phone number in a continuous fashion to the operator. Therefore, no 
useful isolated data could be extracted from these callers. There were 
several calls (518, or 7 percent) where the SO and later another 
observer could not understand one or more of the words that were 
spoken and therefore could not tag them correctly. These were caused 
either by very bad transmission noises or a very pronounced accent. 
For another 269 calls (4 percent), the customer spoke more than seven 
digits. 

If we take into account all the calls that had some problems asso- 
ciated with them we would be left with a total of 2817 calls (38 percent) 
that were “processable,” i.e., these calls contained only spoken digits. 
Therefore, an automatic procedure could be devised to first find the 
spoken words (endpointing) and second perform recognition on those 
words. All further discussion of the BR database will refer to this data 
set. 

One problem that existed in our earlier recognition experiment of 
telephone-quality speech was the inability of the prompts to get the 
callers to speak in an isolated format.® This problem still existed in 
the BR study. Therefore, we decided to segment the database into two 
sets—one where all the digits were spoken in isolation, and another 
that contained those calls with any connected digits. Of the 2817 
processable calls, 1837 contained only isolated digits, and 980 con- 
tained some connected digits. 

The recording hardware had memory for at most a 15-second utter- 
ance. Some callers paused so long in between digits that they simply 
ran out of time. Therefore, we further classified the calls on the basis 
of whether all seven digits were present. The reason for this classifi- 
cation was that if it were known a priori that exactly seven digits were 
present, we could devise a procedure that recognizes them more 
accurately. Of the 1837 calls containing only isolated digits, 1634 (89 
percent) consisted of all seven digits. Of the 980 calls containing some 
connected speech, only three calls had fewer than seven digits. 
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V. REVIEW OF RECOGNITION SYSTEM IMPROVEMENTS 


5.1 Endpoint detector improvements 


Before evaluating this telephone-quality speech database, several 
issues had to be addressed. One issue was the detection of speech 
within some time interval. In our previous study,® we used an endpoint- 
ing algorithm developed by Lamel et al.!* This technique had proved 
quite robust in detecting speech over local dialed-up telephone lines. 
However, endpoint detection becomes a much more difficult problem 
when the transmission system is corrupted by the many noises found 
on standard dialed-up telephone lines (such as those received at TSPS 
offices). Such factors as popping sounds, crackling noises, background 
speech, carrier frequency tones, and other nonstationary noises make 
it very hard to detect word boundaries accurately. 

In Wilpon et al. it was shown that only 69 percent of all words were 
detected by the Lamel approach when tested on a large random subset 
of the PO database.’ Among these, the recognizer accurately classified 
85 percent. This yielded an overall recognition system accuracy of only 
59 percent. Because of these results, it was decided to try and improve 
the endpoint algorithm before proceeding further. This led to the 
development of a new word detection algorithm, called a top-down 
design.’* The new approach makes the assumption that if speech is 
present in some time interval its energy level will be above that of any 
noise also present. Simply put, the new algorithm searches for strong 
(vowel-like) peaks in the energy contour of a speech utterance and 
processes the speech around the peaks to find potential beginning and 
ending points. Several rules involving duration, onset, and decay times 
are then used to refine the endpoint estimates. 

Applying this new endpoint algorithm to the same data set as was 
tested with the Lamel algorithm (i.e., a subset of the PO database) 
yielded a word detection rate of 98 percent and a recognition accuracy 
of 90 percent for an overall system accuracy of 89 percent. Clearly, 
from the results obtained, the top-down endpointing algorithm is 
superior to the Lamel approach. As a result of this research, the top- 
down algorithm was used in all studies of the BR database. Whereas 
in the PO database study over 50 percent of the database had to have 
manual corrections made to the endpoints (because of endpoint algo- 
rithm failures) no manual correction of endpoints was performed in 
the BR study. 


5.2 Clustering analysis improvements 


In Wilpon and Rabiner’? a new clustering algorithm was presented 
that uses the best features of several previously used algorithms—i.e., 
ISODATA?"®, K-means”, and UWA®. This algorithm is called the 
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Modified K-Means (MKM) clustering algorithm. Its main advantage 
over other algorithms is that it is completely automatic, and requires 
no user input (other than a similarity matrix). This algorithm was 
tested extensively on the BR database and was shown to yield recog- 
nition results as good as previously used algorithms. In the experi- 
ments to follow, all template sets created from subsets of the BR 
speech data will have been created using the MKM algorithm. 


VI. RECOGNITION RESULTS ON THE BR DATABASE 
6.1 Isolated word recognition results 


For all isolated word recognition experiments performed on the BR 
database, the isolated database was divided into two disjoint subsets, 
one to train the recognizer and the other to test the system. A total of 
4783 tokens were used for training and another 7973 tokens for testing. 
Table III shows the distribution of the training and testing tokens 
among the ten digits. 

In the PO study, template sets were created both from a random 
subset of the speech data and from the “cleanest” speech data (i.e., as 
judged by a human to be close to laboratory-quality data). Similar 
recognition results were obtained using both template sets. Therefore, 
in the BR study template sets were created using only a random subset 
of speech data. 

The recognition system used in all evaluations was the Linear 
Predictive Coding (LPC)-based isolated word recognition system de- 
veloped and tested extensively at AT&T Bell Laboratories.’” As we 
stated in Section 5.2 the MKM clustering algorithm was used to create 
several sizes of template sets. Table IV shows the recognition results 
for seven different clustering configurations: 3, 6, 12, 20, 30, 50, and 
75 clusters per word. Shown is the per-digit accuracy, average digit 


Table III—Number of tokens for 
each digit used in training and 
testing for evaluating the BR 


database 
Training Set Testing Set 
0 271 312 
1 259 405 
2 675 1145 
3 606 943 
4 580 970 
5 489 901 
6 592 1070 
7 443 750 
8 454 834 
9 414 643 
Total 4783 7973 
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Table IV—Recognition results in percent using BR speech data for 
training and for testing 
Number of Templates per Word 
Digit 3 6 12 20 30 50 75 


0 60.5 71.0 70.7 72.8 69.6 72.8 72.8 
1 85.6 86.1 85.6 88.0 88.2 88.5 87.7 
2 64.2 67.5 74.5 76.8 78.2 81.5 82.6 
3 74.2 84.0 87.5 89.6 91.3 89.9 90.4 
4 88.4 92.4 93.5 92.6 95.4 94.2 92.9 
5 78.9 78.9 86.4 85.0 87.8 89.1 89.5 
6 63.3 75.1 79.8 80.9 79.3 87.7 84.9 
7 62.8 74.7 80.2 84.9 88.4 88.4 90.1 
8 71.4 70.3 75.8 80.1 81.2 83.3 83.3 
9 58.8 69.8 63.4 70.9 74.9 75.6 80.6 
Average 71.0 77.0 80.5 82.7 84.4 86.1 86.3 


String rate 16.3 24.2 29.5 34.1 36.0 42.2 43.5 


accuracy over all digits, and string accuracy, where a string is nomi- 
nally seven digits long. We see that for all template sizes the digits 
zero (or oh) and nine have the highest error rates. The major confusion 
for the digit zero (oh) was the digit four. In the BR and PO studies 
about one-half of the talkers pronounced that word four as /foe/ rather 
than /fawr/ and used the word oh instead of zero more than 75 percent 
of the time. A possible explanation for the confusion could be that 
endpoint detector included too much background noise when deter- 
mining the beginning point for some of the pronunciations of the word 
oh, thereby making the word oh like a /foe/ and misrecognizing it. 
Alternatively, since the frication at the beginning of the word four 
closely resembles typical background noise encountered in our testing 
environment, it would be easy for the speech endpoint detector to 
misplace the beginning marker for this word, thus totally eliminating 
the fricative sound. Since templates for the word four are created from 
this type of data, a spoken digit oh could be misrecognized. Low 
accuracy for the digit nine was also obtained. A possible reason for 
such low accuracy is that the nasal sound is being masked by the 
various noises on the telephone line. Figure 2 shows a plot of digit 
recognition accuracy as a function of the number of templates (or 
clusters) created for each word. We can see that as the number of 
templates per word increases, the recognition accuracy increases 
asymptotically, with the best accuracy (86.3 percent) occurring with a 
75-template-per-word set. 

The average string length was seven digits. Therefore, theoretically 
the average string accuracy is the average per-digit accuracy raised to 
the seventh power (since all single digit recognitions are independent 
of one another). However, for all template set sizes the actual string 
accuracy was greater than the theoretical result, that is, the error rate 
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Fig. 2—Recognition accuracy for training and testing on BR data as a function of 
the number of templates used per word. 


was not uniform over all talkers nor independent of the talker. A 
similar result on another speech database was obtained by Rosenberg 
and Shipley.'® 

Figure 3 demonstrates the effects of imposing a rejection threshold 
on the recognition system. A recognition distance score above the 
threshold would result in a no decision choice by the recognizer. Shown 
in this plot is the percent of no decisions versus the percent of error 
rate. We see that if only a 1-percent error rate could be tolerated by a 
task using this recognizer under these recording conditions, then a 60- 
percent no decision rate must also be accepted. However, a 10-percent 
probability of error was attained with only a 9-percent no decision 
rate. 

If we compare these results (using a 30-template-per-word solution 
for comparison) with those obtained from the PO database study, the 
results from the BR study seem to be worse (84.4 percent for BR 
versus 93.1 percent for PO). However, in the PO study 50 percent of 
the speech data available for testing was eliminated from the database 
because of noise conditions, connected rather than isolated input, and 
hardware failures. Also, the automatic endpoint detector™ was over- 
ruled by human intervention about 50 percent of the time.® In contrast, 
in the BR study all the data available were used and automatically 
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Fig. 3—Plot showing recognition error rate versus no decision choice—training and 
testing data from BR. 


detected before recognition was performed.” For this reason it is felt 
that the BR results are very encouraging. 


6.2 Connected word recognition results 


Recognition was performed on the 980 call subset of the processable 
calls, which contained some connected digit sequences using the level- 
building Dynamic Time Warp (DTW) algorithm of Myers et al.!” 
Testing was carried out with and without augmenting the Itakura log 
likelihood distance with an energy distance as described in Rabiner.’® 
The template set used was the 30-template-per-word set created from 
isolated tokens from the BR database. No embedded training, as 
described in Rabiner et al.’®, was used. 

Since there was not an abundance of connected digit strings within 
the 980 strings (only 1790), I will just present the results and not 
make any categorical remarks on connected digits input over noisy 
channels. Table V shows the results of these experiments. The results 
indicate that using energy information in the distance computation 
improved the string accuracy for all string lengths from 45.8 to 61.0 
percent if the string length is unknown, and from 63.6 to 66.8 percent 
if the string length is known. It is expected that the use of embedded 
training would greatly improve these results. 
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Table V—Recognition results in percent from connected digit 
sequences from BR speech data 


Percent Correct Recognition 


Without Energy With Energy 
No. of Digits in No. of Occur- Known Unknown Known Unknown 

String rences Length Length Length Length 

2 1441 68.5 49.4 71.1 65.3 

3 277 46.6 33.6 51.6 45.9 

4 67 32.8 22.4 38.8 32.8 

5 5 20.0 0.0 60.0 40.0 
Total 1790 63.6 45.8 66.8 61.0 


6.3 A syntax-directed recognition system based on isolated digit input 


The results described previously assume that all single digit recog- 
nitions are independent of each other. But in fact that is not the case 
for this database, as customers were asked to speak their seven-digit 
telephone number. For this well-defined task there is some syntactic 
information that can be used to help guide the recognition system. For 
example, the first three digits of the seven-digit input define the local 
exchange. In general, there are significantly fewer than the 1000 
exchanges within an area-code region. However, the last four digits 
are usually distributed uniformly over the 10,000 possible sequences. 

A recognition system was assembled to make use of the syntactic 
structure of telephone numbers. First, the database was searched to 
find all valid exchanges. This yielded a total of 86 valid exchanges out 
of a possible 1000. The recognition system was then programmed to 
do the following task. For each customer, digit recognition was per- 
formed on the exchange. The output of the recognition system was a 
set of similarity scores’ for each digit, for all digits in the exchange. 
Next, the customer’s actual utterances for the exchange were tagged 
as being that valid exchange with the lowest total distance (i.e., the 
sum of the individual digit scores). The utterances were then converted 
into speaker-dependent templates and added to the previously created 
speaker-independent template set. This new template set was then 
used to recognize the last four digits in the telephone number. This 
procedure was done on a per-talker basis. If in the last four digits the 
customer spoke any of the digits that were in the exchange, having a 
template of those words created by the user should increase the 
probability that the recognizer would correctly recognize those words. 

Whereas the recognition accuracy (for a 30-template-per-word ref- 
erence set) yielded a digit accuracy of 84.4 percent when the above 
system was implemented, the digit accuracy increased to 87.2 percent. 
The string accuracy without syntax, as tested on 984 seven-digit 
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Fig. 4—Recognition accuracy on the exchange string (three digits) from the BR 
database as a function of candidate position (solid line is with syntax and dashed line 
is without syntax). 


strings, was 38.2 percent. This increased to 47.8 percent when syntax 
was added. 

Figure 4 shows a plot of exchange recognition accuracy as a function 
of whether the correct exchange was within the top ten candidates. 
The dashed line shows the results when no syntax is used, and the 
solid line shows the results when syntax is used. We see that the 
correct exchange (all three digits) has been recognized correctly in the 
system with no syntax 64 percent of the time and 87 percent within 
the top five candidates, whereas in the syntax-directed system the 
results are 83 percent and 98 percent, respectively. Figure 5 shows a 
plot of the number of times the correct exchange is within a distance 
A from the minimum possible exchange score (over the 1000 possible 
exchanges). For the Itakura log-likelihood ratio distance the mean 
distance for a correct recognition is about 0.30 and for an incorrect 
recognition about 0.45.28 We see that within a distance of 0.25 from 
the minimum the correct exchange (three digits) is always present. 
Figure 6 shows the average number of possible exchanges within a A 
region over all strings. It shows that using syntax greatly reduces the 
number of recognition candidates (e.g., from 80 to 10 for a A = 0.20). 

Figure 7 shows a plot of the string accuracy of the last four digits as 
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PERCENT FOR WHICH CORRECT EXCHANGE IS WITHIN A 


100 


90 


80 
0 0.10 0.20 0.30 0.40 0.50 


A DISTANCE FROM MINIMUM POSSIBLE DISTANCE 


Fig. 5—Recognition accuracy on the exchange string (three digits) as a function of 
whether the correct exchange is within a A distance from the minimum possible exchange 


score. 
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Fig. 6—Plot showing the average number of exchange candidates within a A region. 
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SYNTAX USED 
——— NO SYNTAX USED 


CORRECT STRING RECOGNITION 
(LAST FOUR DIGITS) IN PERCENT 





1 2 3 4 5 6 7 8 9 10 
CANDIDATE POSITION 


Fig. 7—Recognition accuracy on the last four digits'from the BR database as a 
function of candidate position. The dashed line indicates no syntax was used and the 
solid line indicates syntax was used. 


a function of whether the correct string was within the top ten 
candidates. The dashed line shows the results without using the 
additional templates generated by applying the syntactic rules to the 
first three digits. The solid line shows the recognition results when 
the original speaker-independent template set was augmented with 
the speaker-dependent templates determined through the syntactic 
rules on the first three digits. With syntax the string accuracy only 
improved from 53.3 to 54.5 percent and was within the top five 
candidates 78 percent of the time. 

These results show that adding task information improved the 
overall system recognition accuracy. Augmenting the template set with 
the extra exchange utterance templates slightly improved performance 
on the last four digits. However, the main contribution of the syntax 
was to recognize the exchange more accurately. 


VII. ROBUSTNESS OF SPEAKER-INDEPENDENT TEMPLATES 


One of the goals of this study was to examine the robustness of 
speaker-independent templates created using one population of talkers 
under one set of transmission conditions for different populations and 
transmission conditions. An experiment was carried out in which 
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template sets created from each of the three regional databases (Port- 
land, Baton Rouge, and Murray Hill) were tested on speech data from 
all three databases. 

Initially, a template set was created from a Murray Hill speech 
database that consisted of 100 talkers, 50 male and 50 female. The 
data were collected under laboratory conditions over local Private 
Branch Exchanges (PBXs). A clustering analysis was performed and 
a set of 12 speaker-independent templates per word was created. This 
template set has been tested extensively in other experiments (see 
Refs. 2, 8, 11, 14, 16, and 17). For testing purposes, another group of 
100 talkers (disjoint from the training population) each provided one 
replication of the digits vocabulary. 

The template set used to represent Portland data was a 30-template- 
per-word set created from the “cleanest” speech obtained in the PO 
study.® For testing the entire 11,035-digit database was used. For 
comparison purposes a 30-template-per-word reference set was used 
to model the Baton Rouge database. For testing purposes the entire 
7973-digit testing set was used. 

Table VI shows the results for all cross recognition tests. The symbol 
<AVG> stands for the averaging of recognition results over all three 
databases given a particular training or testing dataset. In order not 
to distort the averages (since each database had a different number of 
tokens), a simple nonweighted averaging was performed. It is felt that 
there was sufficient data in each regional database to make this result 
meaningful. 

Figure 8 (a graphical form of Table VI) shows the results when 


Table VI—Cross template and testing set 
recognition accuracy 


Recognition 
Accuracy in 
Training Set Testing Set Percent 

BR BR 84.4 
BR PO 85.8 
BR MH 92.3 
PO PO 93.1 
PO BR 76.8 
PO MH 91.6 
MH MH 98.4 
MH PO 77.4 
MH BR 62.3 
BR <AVG> 87.4 
PO <AVG> 87.1 
MH <AVG> 79.3 
<AVG>* BR 74.3 
<AVG> PO 85.4 
<AVG> MH 94.1 


* <AVG> = averaged over all three databases 
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0 BR TRAINING 
A PO TRAINING 
@ MH TRAINING 


CORRECT DIGIT RECOGNITION IN PERCENT 





BR PO MH 
TESTING DATA 


Fig. 8—Recognition results for each regional template set as a function of the testing 
set. 


testing each of the template sets against each of the testing datasets, 
where yu is the average recognition accuracy over all three testing sets 
given a training set. Shown is recognition accuracy as a function of 
which testing set was used. This figure shows that the best recognition 
results for each of the test sets occurred, not surprisingly, using the 
template set also created from data in the same region. The recognition 
performance was best with MH data, then with PO data, and last with 
BR data. Also, notice the greater variation in recognition accuracies 
as the self-recognition scores decline. For example, the PO and BR 
templates performed about the same against MH testing data and 
about 7 percent worse than MH templates, whereas the PO and MH 
template sets performed, respectively, 8 and 21 percent worse than the 
BR template set when tested with BR data. 

In Fig. 9, the results are shown as a function of individual digits. As 
was the case in the earlier PO study, we see that the MH templates 
do not adequately represent the speaking style or noise conditions 
present in the PO or BR testing data. For most of the digits in the BR 
and PO testing population the templates from PO and BR yielded 
much better results. 

Figure 10 shows the results in a different context. Shown is recog- 
nition accuracy as a function of the template set used. Interestingly, 
the BR template set performed the best over all three testing condi- 
tions. (Even though the same average accuracy was obtained with the 
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MH TRAINING DATA 
———— PO TRAINING DATA 
——— BR TRAINING DATA TESTING DATA: MH 


CORRECT DIGIT RECOGNITION IN PERCENT 





DIGIT 


Fig. 9—Recognition results for each regional template set as a function of testing set, 
on a per-digit basis. 
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Fig. 10—Recognition accuracy for each regional testing set as a function of the 
template set used. 
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BR and PO templates, the BR templates yielded a standard deviation 
of 4.3 percent compared with 9.1 percent for the PO template set.) 
The MH template set performed significantly worse than either the 
BR or PO template sets, with an average recognition accuracy of 79.3 
percent and a standard deviation of 18.1 percent. 

Figure 11 shows the per-digit recognition results for each testing 
data set given a particular template set. For most digits in the MH 
testing set, the template sets created from BR and PO data yielded as 
good a recognition accuracy as did the templates created from MH 
data. However, again we see the converse not to be true, that is, the 
MH templates yielded significantly poorer recognition results when 
tested against PO and BR testing data than did template sets created 
from those regions. 

Tables VII through IX show confusion matrices generated from 
each of the above recognition experiments. Shown are only those 
confusions that occurred more than 3 percent of the time. In Table 
VII, results are shown for each testing set when recognition was 
performed using the BR template set. In each of the BR and PO 
testing sets the biggest error was the spoken word oh being confused 
for four. Possible explanations for the confusions have been given in 
Section 6.1. This problem did not occur in the MH data as all talkers 
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Fig. 11—Recognition accuracy for each regional testing set as a function of the 
template set used, on a per-digit basis. 
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Table Vil—Confusion matrix for each testing set when recognition 


was performed using the BR template set 
er 2 re 


(a) BR Testing Data 


0 69.6 15.6 5.1 

1 88.2 5.3 

2 4.6 78.2 4.7 5.4 

3 91.3 3.0 

4 95.4 

5 87.8 6.9 

6 4.2 79.3 11.6 

7 88.4 

8 7.5 5.6 81.2 

9 3.1 12.8 4.6 74.9 
(b) PO Testing Data 

0 66.3 27.7 

1 94.2 

2 8.1 17.8 8.1 4.5 

3 95.7 

4 3.3 93.6 

5 6.8 84.0 5.5 

6 82.4 6.8 5.1 

7 85.5 5.0 

8 5.8 90.3 

9 4.4 4.1 8.9 79.0 
(c) MH Testing Data 

0 100.0 

1 98.0 

2 99.0 

3 

4 100.0 

5 4.0 5.0 75.0 4.0 11.0 

6 3.0 94.0 

7 10.0 87.0 

8 3.0 3.0 3.0 4.0 87.0 

9 9.0 5.0 84.0 


used the word zero. Notice that the reverse confusion (i.e., four 
misrecognized as zero or oh) did not occur. 

Table VIII shows the confusion matrices when using the template 
set generated from MH data. The only confusion when tested against 
MH testing data was six versus seven. Notice when testing against 
BR data that all digits are misrecognized as seven a large percent of 
the time. For this testing data there are many major confusions. 

Table IX shows the confusions generated from training with PO 
data. As with the other template sets the five-nine confusion is 
prominent. Also, the MH testing set produced a large confusion 
between the digits nine and one. In examining Tables VII through IX 
the template set generated from BR speech data yielded fewer major 
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Table Vill—Confusion matrix using template set generated from MH 








data 
Seken Recognized Digit 
Digit 0 1 2 3 4 5 6 7 8 9 

(a) BR Testing Data 

0 58.4 11.8 4.77 11.8 6.5 

1 7.8 46.3 6.4 20.1 4.0 8.3 

2 47.9 3.3 70 19.4 19.9 

3 74.2 7.0 5.4 7.5 

4 28.6 3.1 50.6 8.5 3.8 

5 3.3 79.6 7.4 4.2 3.9 

6 4.0 62.6 5.0 24.9 

7 4.6 3.0 12.3 69.4 5.1 3.0 

8 14 4.7 79.2 

9 26.7 14.7 10.2 43.2 
(b) PO Testing Data 

0 84.3 5.6 3.3 

1 6.3 72.9 7.8 4.3 6.3 

2 86.4 6.0 

3 87.7 5.4 

4 34.0 48.7 8.8 

5 3.2 80.4 5.6 6.5 

6 85.8 7.5 

7 3.2 9.2 74.8 3.0 3.2 

8 95.3 

9 4.4 126 12.3 3.9 64.3 
(c) MH Testing Data 

0 100.0 

1 100.0 

2 100.0 

3 100.0 

4 99.0 

5 97.0 

6 91.0 6.0 

7 100.0 

8 98.0 

9 99.0 


confusions (over all three test sets) than either the PO or MH template 
sets. 

Summarizing this experiment, the template set created from a subset 
of BR speech data was quite robust over different populations and 
noise conditions, yielding an average recognition accuracy of 87.4 
percent. Additionally, the MH template set, which was created under 
laboratory conditions, provided poor recognition results when tested 
under “real-word” recording conditions. 

In a final experiment the template sets created from the PO, BR, 
and MH data were combined together to form one large template set 
with 72 templates per word (i.e., 30 templates per word from each of 
the PO and BR sets, and 12 templates per word from the MH template 
set). The testing set for this experiment was the combined testing sets 
from PO, BR, and MH. 
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Table IX—Confusion matrix generated from training with PO data 


Recognized Digit 
Spoken 
Digit 0 1 2 3 4 5 6 ¢ 8 9 

(a) BR Testing Data 

0 63.8 12.5 5.7 3.6 6.8 

1 76.2 13.4 5.6 

2 66.8 5.4 12.7 4.2 8.0 

3 84.6 3.8 

4 5.0 3.4 88.5 

5 87.7 4.7 3.1 

6 5.8 72.1 3.5 15.5 

7 5.8 84.9 

8 10.0 7.9 5.6 72.5 

9 19.4 12.5 60.1 
(b) PO Testing Data 

0 90.9 

1 95.2 

2 95.0 

3 94.2 

4 3.8 93.4 

5 93.1 3.0 

6 87.0 3.2 4.2 

7 93.3 

8 95.3 

9 9.2 85.0 
(c) MH Testing Data 

0 87.0 9.0 4.0 

1 98.0 

2 98.0 

3 98.0 

4 3.0 97.0 

5 8.0 72.0 3.0 14.0 

6 4.0 91.0 5.0 

7 7.0 91.0 

8 3.0 94.0 

9 7.0 90.0 


A recognition accuracy of 90.9 percent was achieved under these 
conditions. In examining a histogram of template usage, several tem- 
plates were used more often for incorrect recognitions than for correct 
recognitions. A test was carried out in which template sets were created 
as subsets of the full 72-template-per-word set. These subsets were 
chosen such that the Net Percent Correct Recognition (NPCR) per 
template, as defined over all three testing sets (i.e., the percentage of 
the time that the template was used for a correct score minus that 
when used incorrectly), was greater than a threshold. Figure 12 shows 
-a plot of the total number of templates used (dashed line) and the 
recognition accuracy (solid line) as a function of the threshold. As 
indicated by the results, 20 templates of the original 720-template set 
yielded only incorrect recognitions. Also, most templates yielded a 
NPCR of 50 percent. As we look for a NPCR of greater than 50 
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percent, the number of templates that qualify goes down exponentially. 
Only 144 of the original 720 templates yielded an NPCR of 100 percent. 

The recognition curve (solid line) shows a similar shape. We see 
that recognition accuracy stays constant for NPCRs of less than 80 
percent, then falls rapidly to 68 percent for an NPCR of 100 percent. 
These two curves show that the total number of templates can be 
reduced by 22 percent from 720 to 560 (or an average of 56 templates 
per word) without reducing the overall recognition accuracy. 

Table X shows a confusion matrix for the combined template set 
(only entries greater than 3 percent are shown). The results indicate 
that this template set yielded fewer confusions than did either of the 
individual template sets, with the major confusions being zero (oh)- 
four, nine-five, and six-eight. 


VIII. DISCUSSION 


The results described in earlier sections show that: 

1. Based on the collection of speech data in Portland and Baton 
Rouge, it is clear that significant problems exist prompting casual 
telephone customers to speak digit strings in an isolated format. 

2. One problem that existed in the Portland study was the inability 
to detect words automatically in nonideal environments. With the use 
of the top-down endpoint detection algorithm,’ this problem was 
greatly reduced in the Baton Rouge tests. 
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_——- oe, 


~ 


~~. TEMPLATE USAGE 
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~ 


PERCENT CORRECT 


NUMBER OF TEMPLATES USED 
CORRECT DIGIT RECOGNITION IN PERCENT 





0 100 
THRESHOLD IN PERCENT 


Fig. 12—Plot showing recognition accuracy and number of templates used as a 
function of template use threshold. 
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Table X—Confusion matrix—combined template set versus all 


testing data 
Recognized Digit 
Spoken: 2 = > = ns 
Digit 0 1 Z 3 4 5 6 q 8 9 
0 86.5 5.6 
1 94.5 
2 90.7 
3 93.9 
4 94.4 
5 92.5 3.4 
6 85.7 7.1 
7 . 
8 3.3 90.5 
9 7.5 84.2 


3. The recognition results obtained from the BR tests were worse 
than those obtained in the Portland study (84.2 and 93.1 percent, 
respectively, on comparable sized reference sets). Since in the Portland 
experiment 50 percent of all data was eliminated from testing and 50 
percent of the remaining data needed human interaction to correct 
endpoint failures, we feel the results from the Baton Rouge study, 
which eliminated no data and did not allow for endpoint corrections, 
more accurately demonstrate our current capabilities. 

4, The addition of syntactic constraints on the isolated word recog- 
nizers output increased the overall recognition system accuracy—from 
84.4- to 87.2-percent digit accuracy and from 38.2- to 47.8-percent 
string accuracy (i.e., seven-digit telephone number). 

5. The template set created from a subset of BR data is quite robust 
over different populations and noise conditions, averaging 87.4 percent 
over the three regional data sets. By creating a combined template set 
based on the templates generated from speech data from Portland, 
Baton Rouge, and Murray Hill, a recognition accuracy of 91 percent 
was obtained when tested on 20,000 tokens of PO, BR, and MH data. 

These results are very encouraging, as they indicate that regional 
“speaker-independent” template sets may not be required to obtain 
the highest recognition accuracy possible over all regions. However, 
since for each regional database the best recognition scores occurred 
using training and testing data from that region, having regional 
templates will improve accuracies in the individual regions. 

To compute the end-to-end recognition system performance num- 
ber, several intermediate results must be combined together. Figure 
13 shows the combination of all steps in the recognition system. 
Starting with all calls that were handled by our recognition system, 
initially 20 percent abandoned the transaction. Of the 80 percent of 
calls remaining, 52 percent required some form of operator assistance 
to complete the call and 17 percent contained some connected input. 
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Fig. 13—End-to-end recognition system performance from isolated digit input. 


This left only 24 percent of the calls consisting solely of isolated digit 
input. Therefore, we can see that the end-to-end system performance 
was 21.3 percent on digits and 11 percent on strings (where a string 
consisted nominally of seven isolated digits). The full recognition 
system was able to handle automatically 11 percent of all calls received. 
If such a system were to be implemented, hopefully over a period of 
time customers. would learn the required task. This would greatly 
reduce the number of transactions needing manual assistance and the 
number of calls containing connected input. Once connected digit 
recognition has achieved the same performance as isolated speech, the 
restriction of isolated input can be relaxed. These improvements 
should greatly increase end-to-end system performance. 


IX. SUMMARY 


Results have been presented from a series of speech recognition 
experiments on a speech database obtained from 7373 telephone 
customers speaking in an actual telephone environment in Baton 
Rouge, Louisiana. The best performance of 86.3-percent correct digit 
recognition was obtained when a set of speaker-independent templates 
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was created from a subset of the data and tested on the remaining 
data. 

We described a syntax-directed recognition system that incorporates 
information about a seven-digit telephone number task. System ac- 
curacy was shown to improve by 9.4 percent. 

Finally, a series of recognition tests was performed to quantify the 
robustness of speaker-independent templates created under one set of 
recording conditions and.tested under another. Additionally, a tem- 
plate set was created from a subset of each of the regional templates 
sets. A recognition accuracy of 91 percent was obtained when tested 
against 20,000 isolated tokens from PO, BR, and MH data. 
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